Yongjun Jiao (yongjun.jiao@sungard.com) has been a professional software developer for the past 10 years. His expertise covers Java SE, Java EE, Oracle, application tuning and high performance and low latency computing. He is a technical manager with SunGard Global Services. Yongjun has posted 3 posts at DZone. View Full User Profile

Java Memory Model From a Programmer's Point-of-View

07.27.2010
| 19416 views |
  • submit to reddit

1 INTRODUCTION


Programmers at large often feel difficult to learn Java Memory Model (JMM for short) from either Java Language Specification Third Edition (JLSv3 for short)[11] or Java Virtual Machine Specification Second Edition (JVMSv2 for short)[12] due to several reasons.


One is he/she lacks basic understanding of out-of-order execution in modern CPUs and compilers.
Next is there are so many academic terms in JLSv3 and JVMSv2 documentations. Informal explanations along with detailed examples are often needed before going further.
Lastly a programmer really doesn’t need to know all of JMM in order to make thread safe programs.
This article tackles JMM from these three angles.

Here is the quote for JMM definition from JLSv3 “A memory model describes, given a program and an execution trace of that program, whether the execution trace is a legal execution of the program. The Java programming language memory model works by examining each read in an execution trace and checking that the write observed by that read is valid according to certain rules.”
Don’t worry if you don’t get it for now. Hopefully you will understand it by the end of this article.

2  OUT-OF-ORDER IN CPUS AND COMPILERS

2.1  OUT-OF-ORDER EXECUTION IN CPUS

For our demo purpose we break down a CPU instruction cycle without loss of generality as follows:

  1. Fetch the instruction
  2. Decode the instruction which loads any required data from the main memory, among other things.
  3. Execute the instruction
  4. Write-Back the result which stores any generated result to the main memory, among other things.

(Each CPU can have different cycles based on different instruction sets. For example the classic RISC cycle also has a Memory Access stage right after Execute dedicated for main memory loads and stores, and its Write-Back is for writing results into registers only[1]. Our simplified four stages have similar functions at a high level and are easier for reasoning)

The CPU can serially perform all stages for one instruction at a time; or parallelize all stages for multiple independent instructions which improve instruction overall throughput.
In the later case the CPU as a whole operates in an assembly line fashion with instructions coming in one side and results out the other. Hence we call it instruction pipeline which can be best illustrated by Figure 1 extracted from Wikipedia[1].

The horizontal line from left to right in Figure 1 represents the time dimension while the vertical line from top to bottom represents “chained” instruction stages. The Figure 1 shows an idea case where an independent instruction enters the pipeline and goes through all the stages as the clock ticks. Overall there are always four concurrent instructions being performed.

Because CPU issues instructions at the processing rate of the slowest stage instead of all the stages together as in non-pipelined case, the CPU frequency can be increased as much as slow stages can be optimized.
Stage 2 and 4 involve load and store main memory operations, respectively, and they are the slowest because modern CPUs are much faster than modern memory systems (can be several orders of magnitude).

In order to reduce the latency in load and store operations, CPUs use registers, caches[4] (their latency is equal or very close to the clock cycle and map to the “working memory” mentioned in section 8.1 of JVMSv2) and buffers / queues (they facilitate asynchronous execution and it is why section 8.1 of JVMSv2 says “the transfer of data between the main memory and a thread's working memory is loosely coupled”).

Figure 1
                                                 Figure 1
Specifically the decode module can have a dispatch queue where fetched instructions remain until their requested data were loaded from main memory to cache or their dependent instructions were finished (see the following Data Dependency). While some instructions are waiting (or stalled), ready instructions are simultaneously decoded and pushed down the pipeline.

The Write-Back module will put the store request into a store buffer if the old data is not in cache yet (The cache controller stores and loads data on a cache line basis. Each cache line is usually larger than an individual memory access. This mechanism is similar to DISK IO block access) and begins to process the next independent instruction. After the old data was put into cache or if it is already in cache, the instruction will override the cache with the new result. The new data will eventually be flushed to main memory asynchronously depending on different policies[4] (such as when the data has to be evicted from cache for a new cache line or along with other data in a batch mode).

When a programmer writes code in List 1, he/she may assume line 7 will be executed / completed before line 8. This assumption is right based on the ideal case in Figure 1. However it can be wrong if you take the CPU registers, caches and buffers into consideration.
For example, if field B is already in cache and A is not, B may be stored into main memory earlier than A. Even though both A and B are in cache, B can still be stored into main memory earlier than A if B’s cache was evicted first. You can argue similarly for loads (for example, A was loaded before B from main memory) or a combination of stores and loads (for example A was loaded before B was stored).

Simply put, the way statements are ordered in the original code like in method writer() is called program orders; the order individual memory references(loads or stores) are completed is called execution order.
Because CPU cache, buffer and speculative execution (see below) add so many asynchronies to the instruction completed time, the execution order is not necessarily the same as its program order, which is how out-of-order execution / reordering happens in CPUs.

class ReadWriteExample {                                         
int A = 0;
boolean B = false;

//CPU1 (thread1) runs this method
void writer()
A = 10; //stores 10 to memory location A
B = true; //stores true to memory location B
}

//CPU2 (thread2) runs this method
void reader() {
while (!B) continue; //loads data from memory location B
// I do care about A and B store order in method writer()
assert A == 10; //loads data from memory location A
}
}
List 1

If your program is single threaded or field A and B in method writer() are only accessed by one thread, you really don’t care about reordering because the two stores in method writer() are independent and program semantics are still maintained even the two stored were reordered.
However if your program is multi-threaded, you do care about execution order sometimes.
For example, CPU1 executes method writer() while CPU2 executes method reader().
(because threads communicate using shared main memory and cache memory is transparent to accesses thanks to the CPU cache coherent protocol, when we say “loads data from memory”, it either means “from main memory if the data has never been loaded by any CPU or “from another CPU’ cache” if that CPU owns the data or “from its own cache” if it owns the data. Please read reference [6] for details)

The above assert in line 15 will fail if CPU1 executed method writer() out-of-order. Even CPU1 executed method writer() in program order, line 15 can still fail because CPU2 can execute line 15 before line 13. You may say no because 15 logically shouldn’t be executed until 13 is done (this is called control dependency). However CPU2 is free to speculatively execute 15 first[3].
Basically when CPU sees a branch such as an if or while statement, it doesn’t know where to fetch the next instruction until the branch instruction finishes.  However Figure 1 shows the CPU performance will be decreased if it waits for the branch instruction without being able to find enough independent instructions. So CPU1 may speculatively execute line 15 based on its prediction. It will commit the execution when it later can approve its prediction path is right. In the reader() case, it means B == true was found in line 13 by CPU1 after line 15, which is possible.

Because CPUs don’t know you care about the order of A and B (see below for cases when they do care about code ordering semantics), you must tell them the order using so called memory barrier or fence (Java has no such a construct. You must use a high level synchronizer to implicitly call it) in order to enforce your ordering semantics. Here is the revised version:

class ReadWriteExample {                                             
int A = 0;
boolean B = false;

//CPU1 (thread1) runs this method
void writer () {
A = 10; //stores 10 to memory location A
membar; //pseudo memory barrier to make sure
//line 7 is executed before the next one
B = true; //stores true to memory location B
}

//CPU2 (thread2) runs this method
void reader () {
while (!B) continue; //loads from memory location B
// I do care about the A and B store order in method writer()
membar; //pseudo memory barrier to make sure
//line 15 is executed before the next one
assert A == 10; //loads from memory location A
}
}
List 2

The membar in writer() only enforces the store order on CPU1; you still need to enforce the load order in reader(), which generalizes as “membar must be used in pair”.

Readers are strongly recommended to read reference [6] for memory barrier details.  Reference [6] shows both the popular X86 (Both 32 and 64 bit) and SPARC TSO actually enforce all memory reference orders except the STORE-LOAD reorder (usually LOAD is not semantically dependent on the previous STORE). So the membar in List 2 actually are no-op.
It also shows all CPUs except ALPHA also honor the data dependency. In List 3 field B’s assignment in line 2 will wait in the dispatch queue until line 1 has executed (completed):

A = 10;                                                    
B = A * 2; //B’s value dependents on its previous A’s value.
List 3

Lastly none of CPUs will reorder a given operation with a store if both are referencing the same memory locations (so they have data dependency) otherwise your program semantics will be violated.
(A curious reader should ask what the membar and data dependency mean to the above pipeline figure. It means the next dependent instruction has to wait for the result of its previous instruction, which puts stalls or bubbles into the pipeline. So the number of concurrent instruction in pipeline will be less than the idea four. Instruction scheduling by CPUs and compilers (see Section 2.2) can eliminate or reduce stalls)

Based on above analyses, we conclude that single-threaded programs run under the illusion of as-if-serial semantics. Effects of reordering are only observable to multi-threaded programs (or reordering in one thread is only observable / matters to other threads). When CPU doesn’t intrinsically know your intended ordering semantics, you must use a pair of membars (implicitly called by high level Java synchronizers).

2.2  OUT-OF-ORDER EXECUTION IN COMPILERS

The compiler here is not the bytecode compiler; instead we mean JIT compiler. A compiler is free to reorder your code either physically or logically based on its optimizations as long as it doesn’t violate your program semantics. Modern compilers have many powerful code transformations. Here are few examples.

2.2.1  Instruction Scheduling

Simply put it avoids or reduces pipeline stalls by clustering independent instructions together.
 

  A = 10;                                                         
B = A + 10; //It has to waits for the previous statement to finish
C = 20; //it has no dependency on its previous 2 statements
List 4

Suppose the compiler finds through complex analyses that A is not in cache yet and C is. So line 1 will trigger multi-cycle-long data loading while line 3 can be done in a single cycle. The compiler cans physically move line 3 in between line 1 and 2 to reduce stalls by one. If the compiler can find more independent instructions, it can do the same reordering by reducing more stalls[8].

2.2.2  Avoid redundancy / Forward substitution

Reuse results that are already computed or loaded and store them for use later instead of re-computing or re-loading them.

//suppose A and B are some objects; r variables are all local
//thread1 runs this method
void foo() {
r1 = A;
r2 = r1.x;
r3 = B;
r4 = r3.x;
r5 = r1.x;
r6 = r1.y;
}

//thread2 runs this method
void bar() {
r7 = A;
r7.x = 3;
r7.y = 5;
}
List 5

The compiler is allowed to reuse the value read by r2 in line 5 for r5 in line 8 (probably through a register) because they are both reads of r1.x without intervening write. So line 8 can be optimized as r5 = r2;
Suppose initially A.x == 0 and A.y == 0 and line 15 and 16 in bar() happens between line 5 and 8 in method foo(). So r5 has value 0 and r6 has value 5, which shows the two assignments in line 15 and 16 appear to be reordered in foo().

Both above transformations maintain your code semantics. However if you do care about the ordering in either case, you must use Java synchronizers (details in Section 3) to tell the compiler to stop reordering which is similar to the CPU membar usage.

3  JAVA SYNCHRONIZER CONSTRUCTS

Section 2 shows both CPUs and compilers add so many uncertainties to the code execution. So without using membar or Java synchronizer in List 1, you can imagine there are quite many execution traces (how the statements in writer() and reader() interleave each other). It usually doesn’t make sense to reason about the assert in reader() in this case, even its execution trace is still legal to JMM.
Although multi-threading is so non-deterministic, you can still use some Java synchronizer (it implicitly calls membar) to establish your intended ordering, and predicate the assert in reader() will succeed.

3.1  SYNCHRONIZED STATEMENT (LOCK AND UNLOCK ACTIONS)

There is a lock action on the synchronized object’s monitor upon entry to the synchronized method or block; and correspondingly an unlock action upon exit to the synchronized method or block. We will use lock and unlock instead of synchronized statement for explanations in the future.
The unlock action comprises a membar action (it is no-op if CAS implicitly calls membar on some processors) followed by a CAS (atomically Compare And Swap a flag to indicate the monitor lock was just released.). It also has an unpark function to unblock a thread waiting for monitor.
The lock action comprises a CAS (atomically Compare And Swap a flag to indicate the monitor lock was acquired) followed by a membar action (it is no-op if CAS implicitly calls membar on some processors). It also has a park function to block itself if the monitor is not available.

In List 2 the programmer first stores a new value 10 to A and then set the ready flag B to true in writer() so that reader() can read the new value. Line 8 and 10 approximate the unlock action while line 15 and 17 approximate the lock action (actually line 15 approximates the spinning lock Sun’s Hotspot VM will use for contended synchronizations[13]. Otherwise the lock and unlock actions also have to call OS routines to park and un-park threads). Here is the new list using Java pseudo synchronization.

class ReadWriteExample {                                          
int A = 0;

//thread1 runs this method
void writer () {
lock monitor1; //a new value will be stored
A = 10; //stores 10 to memory location A
unlock monitor1; //a new value is ready for reader to read
}

//thread2 runs this method
void reader () {
lock monitor1; //a new value will be read
assert A == 10; //loads from memory location A
unlock monitor1;//a new value was just read
}
}
List 6

Suppose writer() gets the monitor first . Because line 8 synchronizes with line 13, we say line 13 happen-before or executes-before or completes-before line 13 and an edge was established from line 8 to 13.
We also know that line 7 happens-before line 8 (it is actually the membar in line 8 that ensures line 7 happens-before the CAS in line 8), and line 13 happens-before line 14 (it is actually the membar in line 13 that ensures the CAS in line 13 happens-before line 14). From the ordering transitivity, we conclude that line 7 happens-before line 14. This execution trace is legal to JMM.
What if reader() gets the monitor first? In order to enforce your program semantics, you need to call wait() between line 13 and 14 and notify() in between line 7 and 8. This will be left to the reader as an exercise (Hint: wait() associates with both lock and unlock actions).

We use happens-before relationship for argument. Generally if one action happens-before another, then the first is visible to and ordered before the second.
We can also use flush and invalidate for argument, which turns out to be easier for understanding. Particularly the unlock action in List 6 “conceptually” flushes all variables from its CPU cache to main memory; and the following lock action “conceptually” invalidates all variables from its CPU cache so that following variable uses will load from main memory. You know this didn’t necessarily happen physically based on reference [6].

In List 6 you can produce other field values besides A before line 8 so that reader() can consume them after line 13. These field value assignments are not ordered themselves but they are with line 8. Similarly these field value uses are not order either but they are with line 13. This is usually what you want. But if you want to enforce ordering among field variables, you have to use volatile variables.

3.2  VOLATILE FIELDS

Accesses (reads or writes) to volatile variables couldn’t be reordered with each other, nor can they be reordered with normal field accesses around them. Here is the new list using volatile variables for List 2:

class ReadWriteExample {                                            
int A = 0; //normal field
volatile boolean B = false; //a data ready flag

//thread1 runs this method
void writer () {
A = 10; //stores 10 to memory location A
B = true; //stores true to memory location B
}

//thread2 runs this method
void reader () {
//we use a while statement to spin CPU for demo purpose;
// using wait and notify may be more efficient.
while (!B) continue; //loads from memory location B
assert A == 10; //loads from memory location A
}
}
List 7

Line 8 synchronizes with line 15 (actually the subsequent read of B in line 15). Based on the above rules, line 8 happens-before line 15 and an edge was established from line 8 to 15. We also know that line 7 happens-before line 8 and line 15 happens-before line 16. From the ordering transitivity, we conclude that line 7 happens-before line 16. This execution trace is also legal to JMM.
You’ve probably already guessed that the volatile write in line 8 and read in line 15 both implicitly call membar directly or indirectly. You are probably right (some JVM may implement volatile using a CPU atomic instruction which may embed a membar).
Volatile write has the same memory effect as monitor unlock and volatile read has the same memory effect as monitor lock. Please read reference [9] for more details.
Again you can argue using flush/invalidate mechanism.
Finally volatile variables obviously can’t be allocated in CPU registers, which should sound familiar to C/C++ developers.

3.3  FINAL FIELDS

An important guarantee on final field is quoted from Section 17.5 of JLSv3 “An object is considered to be completely initialized when its constructor finishes. A thread that can only see a reference to an object after that object has been completely initialized is guaranteed to see the correctly initialized values for that object's final fields.”
Here is the example from the same section 17.5 with additional comments:

class FinalFieldExample {                                     
final int x;
int y;
static FinalFieldExample f;

public FinalFieldExample() {
x = 3;
y = 4;
// f = this; //don’t do this
}

//one thread executes this method
static void writer() {
f = new FinalFieldExample();
}

//another thread executes this method
static void reader() {
if (f != null) {
int i = f.x; //guaranteed to see 3
int j = f.y; //could see 0
}
}
}
List 8


Because the compiler will insert a membar between line 7 and 14, line 7 happens-before 14 and the reader() will be guaranteed to see the properly initialized value for f.x if  line 14 also happens-before line 19(19 happens-before 20 due to data dependency). This execution trace is also legal to JMM.
If you uncomment line 9, the above guaranteed will not be held because there is no ordering guaranteed between line 7 and 9.

An important application thanks to the above guarantee is quoted from Section 17.5 of JLSv3 “Final fields allow programmers to implement thread-safe immutable objects without synchronization. A thread-safe immutable object is seen as immutable by all threads, even if a data race is used to pass references to the immutable object between threads. This can provide safety guarantees against misuse of an immutable class by incorrect or malicious code. Final fields must be used correctly to provide a guarantee of immutability.”

If we remove field variable y from List 8, object FinalFieldExample becomes immutable. However without the guarantee on final fields, line 20 can see the default value 0 before seeing value 3, which means object FinalFieldExample is mutable to the reader() thread due to data races.
On the internet, you can find many examples regarding immutable object’s safety guarantees for multi-threaded programs.

3.4  OTHERS

Here are other synchronizer constructs that are easier to understand and accordingly we will not elaborate:

  • Each action in a thread happens-before every action in that thread that comes later in the program's order.
  • A call to start on a thread happens-before any action in the started thread.
  • The write of the default value (zero, false or null) to each variable happens-before the first action in every thread. Conceptually every object is created at the start of the program with its default initialized values.
  • All actions in a thread T1 happens-before any action in another thread T2 that detects that T1 has terminated. T2 may accomplish this by calling T1.isAlive() or T1.join().
  • If thread T1 interrupts thread T2, the interrupt by T1 happens-before any point where any other thread (including T2) determines that T2 has been interrupted (by having an InterruptedException thrown or by invoking Thread.interrupted or Thread.isInterrupted).
  • The completion of an object’s constructor happens-before the execution of its finalizer method.
  • More synchronizers and happens-before relationships are introduced in java.util.concurrent package. Please read the package’s javadoc description for more details.


Finally we say happens-before is a partial order over all actions in a program’s execution trace because CPUs or compilers can freely execute action A before action B or vise versa or execute both concurrently (they have no ordering relationship) if they are not constrained by any combination of the above happens-before rules. This freedom is the key to good performance based on our analyses in Section 2.
We can think it another way: a thread is free to run in isolation without caring about any other threads if it didn’t see any Java synchronizer; otherwise it is obligated to show its impact to other threads through Java synchronizers.

4  SAFE MULTI-THREADING

Although JMM is quite complex for implementation and the underlying reordering is difficult to understand, a programmer only needs to ensure his/her program is correctly synchronized for safe multi-threading. He/she doesn’t need to worry that reordering will affect the code.
This is a strong guarantee for programmers from Java’s sound memory model.

4.1  DATA RACE AND CORRECTLY SYNCHRONIZED

Before we explain how to synchronize our programs correctly, we first borrow two concepts from JLSv3: conflicting and data race.
Two accesses to (reads of or writes to) the same variable are said to be conflicting if at least one of the accesses is a write.
When a program contains two conflicting accesses that are not ordered by a happens-before relationship, it is said to contain a data race.

A program is correctly synchronized if and only if all its sequential execution traces (program orders) are free of data races (the authors disagree with “sequentially consistent executions” used by JLSv3 because those terms already rule out data races based on JLSv3’s own definition on sequentially consistent).
Incorrectly synchronized program often exhibits counterintuitive behaviors as shown in many examples in JLS and JVMS. It usually doesn’t make sense to reason about incorrectly synchronized program.

In List 1, there are both a reader and writer for field A and B, so the reader and writer are conflicting on both A and B.  Because there is no Java synchronizer to establish any happens-before relationship, the program’s any execution trace has a data race on both A and B. Accordingly the program is incorrectly synchronized.
Its new versions in List 6 has two sequential execution traces: one is the writer() followed by the reader(); the other is the reader()followed by the writer(). Because of the synchronized statement (lock and unlock actions), neither execution trace has data race on field A. Accordingly the program is correctly synchronized.
List 7 is also correctly synchronized. Its analysis is left to the reader.
Another very good example was discussed in section 17.4.5 of JLSv3. Readers are strongly recommended to read this example.

4.2  SYNCHRONIZE ON BOTH READ AND WRITE ACCESSES

Some developers naively think they only need to synchronize on write because the synchronization will flush the new value to main memory so that the later read will find it.
Going back to List 6, a developer may want to remove the synchronization in line 13 and 15 for method reader(). But line 7 will not guarantee to happen-before 14 based on our analysis in Section 3.1.
You can even use the flush and invalidate for reasoning: Although the synchronization on the writer() thread flushes the new value to main memory, the reader() thread may still read its old cached value without synchronization.

Based on the definitions in Section 4.1, there is a data race and the program is not correctly isochronized if you only synchronize on write access.

4.3  ATOMIC OPERATIONS

The JMM guarantees that accesses to all Java primitive types and references (not the referenced objects) except double and long are atomic. Although a single write to a non-volatile long or double value is not atomic, writes and reads of volatile long and double values are always atomic.

Sometimes the atomic operations on primitive types lead programmers to wrong assumptions because they think the purpose of using Java synchronizer is to create atomic operation to prevent shared data structure from being corrupted. But from our analyses, you know they missed its another important application – ordering guarantee.
Going back to List 7, some programmer wants to remove the volatile keyword because they think both field A and B are of primitive types and all assesses to them are atomic.  However without using volatile you already know line 7 can’t be guaranteed to happen-before line 16.  More specifically the compiler is free to read B just once into a CPU register, and reuse the cached value repeatedly in line 15. This would mean that line 15 never terminates, even another thread set B to true in method writer().

4.4  DOUBLE-CHECKED LOCKING

Many programmers have seen the following double-checked locking idiom:

class SingletonDemo {                                              
private int f; //instance field
private static SingletonDemo instance = null;

private SingletonDemo () {
this.f = 10;
}

public static SingletonDemo getInstance() {
if (instance == null) { //1st check for lazy initialization
synchronized (SingletonDemo.class) {
if (instance == null) {//2nd check for lazy initialization
instance = new SingletonDemo ();
}
}
}
return instance;
}
}
List 9

Line 13 triggers two writes: one is the initialization of SingletonDemo object (the private constructor at line 5 is called); another is the reference write to the class variable in line 3. Because the two writes can be reordered either by CPU or compiler, some thread will get a partially constructed SingletonDemo object.
There are at least three solutions. The first is use the volatile keyword in line 3, which will ensure the initialization of SingletonDemo object happens-before the reference write to class variable in line 13.
The second is use the final keyword in line 2.  This is left to the reader for further analysis.
The third is rely on JVM’s synchronized initialization for static fields as follows. Readers are recommended to read section 12.4.2 of JLSv3 for JVM’s detailed thread-safe initialization procedure.

class SingletonDemo {                                             
private int f; //instance field
//JVM guarantees line 4 is thread safe and happens-before line 10
private static SingletonDemo instance = new SingletonDemo();

private SingletonDemo () {
this.f = 10;
}

public static SingletonDemo getInstance() {
return instance;
}
}
List 10

4.5  CAUSILITY

This is a hard one. I pull out the same example with more comments as in section 17.4.8 of JLSv3:

//initially both field x and y are zeros                        
//Thread 1
r1 = x;
if (r1 != 0) y = 1;

//Thread 2
r2 = y;
if (r2 != 0) x = 1;
List 11

The code in List 11 is correctly synchronized because neither of writers in line 4 and 8 will occur (so no data races) if this code is executed in a sequential program order.
Since the code is correctly synchronized, the only behaviors we can allow are sequentially consistent behaviors. However, under the happens-before or release consistency memory model[5], both compiler and CPU are allowed to speculatively execute the write in line 4 or 8 first before line 3 and 7, respectively as long as it can later approve its if statement is true.
Suppose thread 2 speculatively execute x = 1 first. Here is the execution trace:

x = 1;  //speculatively executed by thread 2                    
r1 = x; //thread 1 sees write of x = 1 by thread 2
y = 1; //executed by thread 1; caused by 6
r2 = y; //thread 2 sees write of y = 1 by thread 1 and approves 5
List 12

Because line 3 can cause line 4 to execute (if x’s value is 1) and line 7 can cause line 8 to execute (if y’s value is 1) in List 11, the above execute trace form a circular causality. It is illegal to JMM.

We refer to the issue of when reads (line 2 in List 12) can see future writes (line 1 in List 12) as causality, which sometimes causes unacceptable out-of-thin-air behavior like the above one.
Here is the good news. Because the code is correctly synchronized, JVM will ensure its sequential consistent behavior without a programmer to do anything. Please read section 17.4.8 of JLSv3 if you really want to know how JVM does so.

5  LIGHT-WEIGHT TASK FRAMEWORKS BEYOND JAVA HEAVY THREAD

A reader may have noticed two contradictory directions so far. On one hand we allow CPUs and compilers to reorder code as much as possible in order to have higher instruction level parallelism; on the other hand a programmer must use Java synchronizers to avoid reordering by CPU or compiler for conflicting data accesses, which lowers instruction level parallelism. The more Java synchronizers the more program performance degrades.
Also as more and more threads are created to handle requested tasks, the cost of thread’s lifecycle management and context switches (because Java threads map one-to-one to OS native threads, they invoke a system call for each action) can be greater than the computation time of the tasks themselves.  

So some light-weight task framework built upon Java heavy thread is often preferred. The Java executor framework is the first step forward. It allows users to submit fine-grained tasks. It basically fixed the thread’s lifecycle overhead.
The upcoming release of Java SE 7 will include a parallel programming model called Fork-Join[15]. It was built on the previous executor framework. Its lightweight task scheduling mechanism based on the work-stealing from Cilk greatly reduces Java synchronizer’s contention, and facilitates more fine-grained task submission through recursive divide-and-conquer.

All multi-threading frameworks we’ve talked about so far are based on shared-memory architectures such as SMP, which is generally error prone to data race conditions and even dead locks. Ensuring data race free (correctly synchronized) code by programmers at large is often difficult. On the hardware side, shared-memory architectures only scale linearly to a small number of processors (usually 32) due to their cache coherent protocol overhead.
The Actor model can overcome the above shortcomings through its native fit of distributed computing. Please refer to [16] for introduction.

6  RESOURCES

The following five references give you an introduction on out-of-order execution on modern CPUs:
[1] Instruction Pipeline by Wikipedia
[2] Out-of-Order Execution by Wikipedia
[3] Speculative Execution by Wikipedia
[4] CPU Cache by Wikipedia
[5] Release Consistency by Wikipedia
This reference tells you the current memory model development in Java and other mainstream languages:
[6] Memory Barriers: a Hardware View for Software Hackers by Paul E. McKenney
This reference tells you how mainstream processors handle out-of-order memory references:
[7] Memory Models: A Case for Rethinking Parallel Languages and Hardware by Sarita V. Adve et al.
This reference tells you an important compiler reorder mechanism:
[8] Instruction scheduling By Wikipedia
The following four references tell you everything about Java memory model (JMM) JSR-133:
[9] The JSR-133 Cookbook by Doug Lea
[10] The Java Memory Model by William Pugh
[11] Chapter 17 “Thread and Locks” in the Java Language Specification, Third Edition
[12] Chapter 8 “Threads and Locks” in the Java Virtual Machine Specification, Second Edition
This reference introduces you some of Sun’s current JIT compiler optimizations:
[13] The Java HotSpot Performance Engine Architecture
The following two references introduce you Java thread executor and fork-join frameworks:
[14] Concurrency JSR-166 Interest Site
[15] A Java Fork/Join Framework by Doug Lea
This reference introduces you distributed computing by actor model:
[16] Actor Model by Wikipedia

ABOUT THE AUTOR

Yongjun Jiao has been a professional software developer for the past 10 years. His expertise covers Java SE, Java EE, Oracle, application tuning and high performance and low latency computing. He is a technical manager with SunGard Consulting Services.
Lili Zhou has been conducting academic researches using C/C++ and Matlab under cluster environments for over 5 years. Her expertise is medical image processing. She holds a PhD degree from Deptment of Electrical and Computer Engineering, Stony Brook University.

AttachmentSize
instruction_pipeline.JPG25.45 KB
Published at DZone with permission of its author, Yongjun Jiao.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Karthik Balasub... replied on Wed, 2010/08/04 - 12:42pm

Brilliant. This is the most succinct explanation of the Java memory model that I've read. Many thanks! Karthik

Yongjun Jiao replied on Mon, 2010/08/09 - 12:04pm

Regarding the "Causality" in Section 4.5, it turns out that there are no processors supporting it. Here is the email response from Profession Adve along with my original request:

Dear Yong,

There are no processors today that do this (otherwise, we would have trouble with the C++ and Java memory models prohibiting this). The goal here is to ensure (1) future processors never allow this outcome, and (2) to make it part of the formal semantics so programmers can rely on it. The data-race-free model clearly prohibits this outcome. Sarita

 

From:JIAO, YONGJUN[mailto:YONGJUN.JIAO@nexteraenergy.com]
Sent: Monday, August 02, 2010 5:06 PM
To: Adve, Sarita V
Subject:

Dear professor,

I learned a lot from your article "Memory Models: A Case for Rethinking Parallel Languages and Hardware".

However I do want to consult you for case (a) in Figure 4 on page 4:

what microprocessors support such a "causality loop"?

My understanding is when core 1 speculatively executes "Y = 1", it puts the new value in a temporary store (core 2 will not see it) until it detects the predicate in the "if" statement becomes true; I have similar argument for core 2.  So neither core 1 nor core 2 will make its speculative write operation visible to the other core.

Regards,

Yong Jiao

Darryl Miles replied on Sat, 2010/08/21 - 9:08pm

4.4 Double Checked Locking.

If I understand your claim correctly you are citing that the sequence of events (as seen in memory from another thread) might be:

  • Thread1, accesses SingletonDemo.GetInstance(), this causes the assignment of null value (line 03) to occur before execution gets inside the GetInstance() method.
  • Thread1, Enters the synchronized() block.
  • Thread1, Assign new instance value to static field (line 13), memory is written, crutially this memory write occurs before the (memory write) assignment of the integer value 10 which happens later, see that below, crutially this is the major claim you are making with your example.
  • A context switch occurs, this slows/suspends Thread1 but allows Thread2 to run
  • Thread2, Calls GetInstance() and returns with a non-null instance, the one created above.
  • Thread2, Accesses SingletonDemo.f in some way to read a zero value
  • A context switch occurs, this wakes Thread1
  • Thread1, Assign value 10 to field 'f' (line 10, memory is written
  • Thread1, Exits the synchronized() block.
  • Thread1, Return value from method, reads null (line 17)

If I understand correctly the crux of the claim is that a JVM is free to reorder memory writes even across block boundaries (or specifically across method invocation boundaries and this method happens to be a constructor).  Do you have a test case and can cite a publically available 1.4.x or later JVM released from across the top 8 brands that proves this.

There was a situation a few years back with the GCC C language compiler where an optimization in relation to reordering writes was made, this broke a lot of code and (from recallection) although the C lang specification as written could be interpreted to allow for this, it was just never done because it was non-intuative to the users of the language.  The de-facto standard born out of the implementations of the language says although it was possible it wasn't a good idea to do so, I would like to know if such maybe true for Java.

I would really appreciate if someone could blog / explain very specifically the interpretation of the relevant sections of specification used to achieve the possible outcome you propose.  I just can not believe that there isn't already memory barrers in place for:

  • Crossing a method boundary (even if/when an actual function call was optimized away, so if a boundary exists in the code as written the JVM never forgets the programmers intention)
  • That all static initializers are instated before any new instance of a class can be created (before the first invocation to a constructor is called, the initializers have run and committed their memory writes).  I am under the impression that static inititalizers are serialized by the JVM by type (i.e. by type by classloader) while class constructors are not serialized.

There maybe some language requirement in java to have a pseudo method call to enforce flushes (write) and reloads (read) of memory.

Now to come back on that example I would like to highlight this problem:
class SingletonDemo {                                              
private int f; //instance field
private static SingletonDemo instance = null;

private SingletonDemo () {
this.f = 10;
}

public static SingletonDemo getInstance() {
SingletonDemo tmpInstance = instance;
if (tmpInstance == null) { //1st check for lazy initialization
synchronized (SingletonDemo.class) {
if (instance == null) {//2nd check for lazy initialization
tmpInstance = new SingletonDemo ();
instance = tmpInstance;
}
}
}
return tmpInstance;
}
}

Note by using the local variable tmpInstance, I am forcing the evaluation of the load from static variable to happen exactly once (in the use case where locking is not in force).  It is my understanding there is nothing stopping a lame JVM from evaluating the load from static variable 'instance' twice (when using your example).

Once in the outer 'if(instance == null)' test and once for the 'return instance;'.  But in my example by copying it to a local variable I am making it clear to the JVM that is _MUST_ ensure the value it tested for ==null is the same value it returns when !=null.

Obviously if 'instance' is never written to outside of the example code in the lifetime of the class/type then there is nothing to worry about in your example.  Since once a non-null value has been assigned it is never changed.  But there is code using this kind of snippet where 'instance' is reassigned to a different instance of type SingletonDemo() for whatever reason over the lifetime of the application.

A variation I use on this (where some piece of code needs to build a class tree and once built atomically make it visible to other users when they lookup this kind of global):

  public static SingletonDemo getInstance() {
SingletonDemo tmpInstance = instance;
if (tmpInstance == null) { //1st check for lazy initialization
tmpInstance = new SingletonDemo(); // construction of the class has no visible external side effects
synchronized (SingletonDemo.class) {
if (instance == null) {//2nd check for lazy initialization
instance = tmpInstance; // we won
else
tmpInstance = instance; // we lost
}
}
}
return tmpInstance;
}

 Two threads will race to set, both will achieve construction and only the last minute assignment of the variable needs to lock out the other thread(s).  The above is my idiom for your example.  You cite there are 3 solutions but I propose the above as a forth and as a model snippet for anyone to copy.

Darryl Miles replied on Sat, 2010/08/21 - 9:38pm

I would agree with Yongjun Jiao on 4.5 CAUSILITY.

It should be noted that there is a difference between a speculative execution being executed and visible only to the CPU which decided to speculate, than going on to make that speculation public by having the memory write live beyond the level 1 cache before the speculation was proved correct.

I use the term level1 cache to mean a cache for which only a single CPU has visibility on.  That CPU is free to speculate all it wants into it so long as no other CPU can see it and the memory bus doesn't get a real write out to it.

If the speculation was wrong that single CPU would invalidate the level1 cache memory location, forcing it to recover the previous correct value for future operations. Probably from level2 cache.

This is how I would envisage such things being done, should a CPU want to implement it.

 

The question about JVM memory model is really about what things are observable from another thread in respect of what is going on.  Much of what happens in the CPU that "goes wrong"  (branch prediction/speculation) stays contained in there with no visible external side effects.

You have to consider that while many parts of the Java specification "document explicit requirements" other parts that are "open to interpretation" are really to allow for future advancement and experimentation to occur without any major breaches in Java design.  This doesn't necessarly mean:

  • Those advancements will be made (can money be made from the idea?)
  • Those advancements improve anything (after R&D we did not find the perfomance gains from this approach and took another one instead, the CPU world changes a lot in a few years)
  • Those advancements if made, make programming easier (after R&D we found it better to contain the problem in silicon than expose it developers to work around)
  • Those advancements if made, benefit the majority of code and so become a standard optimization (this kind of thing maybe useful but to limited use cases, so to benefit the majority of code only the users wanting to use the feature will be required to jump through hoops to use it, rather than cluttering up all code and making everyone jump through hoops to work around it).

 

 

Yongjun Jiao replied on Tue, 2010/08/24 - 1:03am in response to: Darryl Miles

Regarding "Thread1, Return value from method, reads null (line 17)", I don't think thread1 can return NULL because if thread1 returns null in line 17, it means line 17 was reordered ahead of line 13 in my List 9. But I don't think the compiler will do such reordering for a return statement.

Regarding "If I understand correctly the crux of the claim is that a JVM is free to reorder memory writes even across block boundaries", I don't think the compiler will reorder operations across block / method boundaries. Actually inlining method to eliminate block/method boundaries is just making it possible to reorder operations that were originally in different methods (http://java.sun.com/products/hotspot/whitepaper.html).

Regarding "The de-facto standard born out of the implementations of the language says although it was possible it wasn't a good idea to do so, I would like to know if such maybe true for Java.", I totally concur. Although the current JMM has improved a lot to eliminated many counter-intuitives such as out-of-thin-air values, it still has to do something in order for most Java developers to easily understand it.

Regarding "That all static initializers are instated before any new instance of a class can be created (before the first invocation to a constructor is called, the initializers have run and committed their memory writes).", I don't think you can find such a rule in Dou Lea's JSR 133 cooking book. I know this is very counter-intuitiv. More comments please!

Regarding "There maybe some language requirement in java to have a pseudo method call to enforce flushes (write) and reloads (read) of memory.", actually Dou Lea proposed a Fence API for the upcoming Java 7 release for this purpose.

Regarding "It is my understanding there is nothing stopping a lame JVM from evaluating the load from static variable 'instance' twice (when using your example).", I am not quite clear about this. But if a thread loads a second time in the return statement in line 17 in my List 9, it must have a compelling reason to do so. But the return statement is not in any sychronized, the thread should assume it is the only one to be running at that time.

Regarding your 2nd listing, I don't think it can fix the out-of-order write problem because line 4 still incurs 2 memory writes: one is the write of tmpInstance;the other is SingletonDemo's field intialization. The 2 writes can still be reordered by either compiler or CPU. You may have known this web site for DCL discussions by thought-leaders.

 

 

Darryl Miles replied on Thu, 2010/08/26 - 5:32am

"Thread1, Return value from method, reads null (line 17)" yes my mistake well spotted, my intention was "Thread1, Return value from method returns non-null value (line 17)", I shall edit the post if I can to fix this, thanks for pointing it out.

 

My point with not reordering memory writes across block boundaries should also includes inlining.  Just because a JVM/compiler chose to inline; I am claiming the optimizer should not forget the boundaries that do exist in the code as written in Java language by the programmer, even if the optimizer removes some of those boundaries to perform allowed optimization.  What stops a JVM claiming that everything is inlined so it never needs to store anything back to memory just so long as the thread using the data sees a consistent view.

Having read the JSR-133 cookbook link provided the first paragraph under Memory Barrier concurs with my last paragraph (or at least my interpretation of cookbook).

 

Re: static initialization: Another point not made clear enough in my original comment (was ordering of static variable intiailizers, with static method invocation with method invocation).  You requested "More Comments Please!" on this point.  I believe these are all well defined by specification:

  • Class type is created, in doing this the static fields are initialized and any static code blocks are run.  This is serialized by the JVM automatically.  This also occurs outside of the context of the application program since it is a JVM infrastructure requirement during the class loading process.  I take it as a given that all writes to memory caused by static field initializers and static code block execution have been committed to memory before a static method or constructor can be invoked.  This is because other threads can invoke those method in parallel since once the Class type has been instated by the JVM is it free for all to use by any thread using the same ClassLoader.  The thread that caused the initial class loading to take place has no special treatment with regard to memory write ordering, all threads are equal in this regard.
  • Now the static method getInstance() maybe executed.
  • Which in turn create a new instance of the class and calls the constructor.
I believe the class loading/creation process is all well defined behavior with strict arbitration provided by the JVM (as I am sure you know already).  If multiple threads go to use the same class at the same time for the first time since JVM startup, typically the first thread in will do all the work (largely arbitrated by the JVM implementation) the other threads will be temporarly locked out and queued up until that work has completed and then they will then be able to act according to the results.  Typically locating the Class in the ClassLoader registry and reusing it right away.  Since you must understand in this situation it is possible for Thread1 to do all the work and instate the Class then get put to sleep (before it gets chance to use the Class) and Thread2 ends up being the first thread to invoke a static method or contructor on that class.

 

Re: the lame/simple JVM implementation performing two loads from memory: You are making a presumption than an optimization will occur.  Optimization is never a requirement of a Java compiler implementation.  This is to come back on your "must have a compelling reason" stance.  Yes I agree any decent compiler would load/copy the static field 'instance' value into a register and reuse the value in both statements, the code as written does not require it to do this.  However it is really easy to ensure that this is the case.  This is the point I make.

The code as written indicates that two read accesses on the memory location holding the variable 'instance' may occur.  However due to the invention of threading it is possible for that memory location to be modified by another thread between the two calls.  But writing it to tmpInstance provide guarantees expected about the consistancy of the variable being tested and returned, since no other thead can access my thread's stack and local data.

Your example is of a single assignment initialization from null to non-null value; it is not a real issue for your demo as no other value is ever assigned.  But some readers maybe considering variations of it for lockless-read based on the atomic read of object reference rules. 

 

Re: 2nd Listing: yes it does incur two writes, both writes are to private data areas that only the current thread can access.  We then enter a synchronized() block.  I say on entry we should get a global read barrier and the JVM must drop any local copy of non-local variables that the current thread is using (this forces a reload from memory), but what does JMM say ?    If what I claim is true then those two writes get committed.  Then the static variable is updated and written to memory then we exit the synchronized() block.  It is customary in the C language and POSIX threading primitives to have exactly this behavior around lock and unlock points (which are the entry and exit points of a synchronized() block).

Yes I have seen the reference in your last paragraph but this looks like it was written circa 2001 (i.e. before 1.4.x) and before the current advancement in CPU technology.  i.e. the route to CPU optimizations that were forseen in 2001 may not have been execute exactly as expected, CPU technology went in a different direction.  At that link there are examples of how not to do it, but I did not see:

class SingletonDemo {                                              
private int f; //instance field
private static SingletonDemo instance = null;
private static Object lockObject = new Object();

private SingletonDemo () {
this.f = 10;
}

public static SingletonDemo getInstance() {
SingletonDemo tmpInstance = instance;
if (tmpInstance == null) { //1st check for lazy initialization
synchronized (lockObject) {
tmpInstance = new SingletonDemo ();
}
synchronized (SingletonDemo.class) {
if (instance == null) //2nd check for lazy initialization
instance = tmpInstance; // we won
else
tmpInstance = instance; // we lost
}
}
return tmpInstance;
}
}

Which is a minor variation on my 2nd Listing.  I surmise the extra synchornized() block is not needed.

 

Also one last point.  As a programmer most useful behavior of synchronized() block is that on entry into the block a read memory barrier is performed and on exit of the block a write memory barrier is performed.  I am not sure if this is what the specification says.  A read memory barrier means that all values read and referenced (from inside the block) must be reloaded from memory again, this would also impliclty mean that any unwritten data has to be flushed to memory as well, since you can't reload it from memory if you have not yet stored it to memory because it is a pending-write-to-memory.  It is ambigious what to do about pending-writes on data which are not referenced from inside the syncronised block.  Which is the angle of your double-checked locking example problem.

Maybe a pretty solution out of this concern is to add additional argument(s) to synchornized() calls to describe the scope (only referenced vars inside block LOCAL / all data GLOBAL) and type (read/write membar) and timing (on entry / on exit of synchronized block).  Then have JVMs that do not support all modes of operation always degrade to the safest conservative mode providing the guarantee the programmer wanted.  This way support for the new API can quickly be provided by all JVMs but over time as certain permutations of optimization prove their worth the JVM patch updates can implement those particular rules more efficiently.   synchronized(Object object, SynchronizedMode onEntry, SynchronizedMode onExit) { }   synchronized(object, SynchronizedMode.GLOBAL_READ, SynchronizedMode.LOCAL_WRITE) { }  where "GLOBAL_READ, GLOBAL_READ" would be the most conservative fallback mode, not forgetting a read supersets/implies a write.  I would like to see the existing synchronized(Object object) { } mean this conservative mode, it will not break anything but might make it less optimal this can be fixed in existing code by adding 2 appropiate arguments via search and replace.

 

Thank You for your article and getting people thinking again on the subject of the Java Memory Model.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.