Martin is a high-performance and low-latency specialist, with experience gained over two decades working with large scale transactional and big-data domains, including automotive, gaming, financial, mobile, and content management. He believes Mechanical Sympathy - applying an understanding of the hardware to the creation of software - is fundamental to delivering elegant, high-performance, solutions. Martin was the co-founder and CTO of LMAX, until he left to specialise in helping other people achieve great performance with their software. The Disruptor concurrent programming framework is just one example of what his mechanical sympathy has created. He blogs at mechanical-sympathy.blogspot.com Martin is a DZone MVB and is not an employee of DZone and has posted 29 posts at DZone. You can read more from them at their website. View Full User Profile

Java Lock Implementations

11.22.2011
| 9387 views |
  • submit to reddit
We all use 3rd party libraries as a normal part of development.  Generally, we have no control over their internals.  The libraries provided with the JDK are a typical example.   Many of these libraries employ locks to manage contention.

JDK locks come with two implementations.  One uses atomic CAS style instructions to manage the claim process.  CAS instructions tend to be the most expensive type of CPU instructions and on x86 have memory ordering semantics.  Often locks are un-contended which gives rise to a possible optimisation whereby a lock can be biased to the un-contended thread using techniques to avoid the use of atomic instructions.  This biasing allows a lock in theory to be quickly reacquired by the same thread.  If the lock turns out to be contended by multiple threads the algorithm with revert from being biased and fall back to the standard approach using atomic instructions.  Biased locking became the default lock implementation with Java 6.

When respecting the single writer principle, biased locking should be your friend.  Lately, when using the sockets API, I decided to measure the lock costs and was surprised by the results.  I found that my un-contended thread was incurring a bit more cost than I expected from the lock.  I put together the following test to compare the cost of the current lock implementations available in Java 6.

The Test

For the test I shall increment a counter within a lock, and increase the number of contending threads on the lock.  This test will be repeated for the 3 major lock implementations available to Java:
  1. Atomic locking on Java language monitors
  2. Biased locking on Java language monitors
  3. ReentrantLock introduced with the java.util.concurrent package in Java 5.
I'll also run the tests on the 3 most recent generations of the Intel CPU.  For each CPU I'll execute the tests up to the maximum number of concurrent threads the core count will support.

The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.

The Code
import java.util.concurrent.BrokenBarrierException;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;

import static java.lang.System.out;

public final class TestLocks implements Runnable
{
    public enum LockType { JVM, JUC }
    public static LockType lockType;

    public static final long ITERATIONS = 500L * 1000L *1000L;
    public static long counter = 0L;

    public static final Object jvmLock = new Object();
    public static final Lock jucLock = new ReentrantLock();
    private static int numThreads;
    private static CyclicBarrier barrier;

    public static void main(final String[] args) throws Exception
    {
        lockType = LockType.valueOf(args[0]);
        numThreads = Integer.parseInt(args[1]);
       
        runTest(numThreads); // warm up
        counter = 0L;

        final long start = System.nanoTime();
        runTest(numThreads);
        final long duration = System.nanoTime() - start;

        out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
        out.printf("%,d ns/op\n", duration / ITERATIONS);
        out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
        out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads) throws Exception
    {
        barrier = new CyclicBarrier(numThreads);
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestLocks());
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        try
        {
            barrier.await();
        }
        catch (Exception e)
        {
            // don't care
        }

        switch (lockType)
        {
            case JVM: jvmLockInc(); break;
            case JUC: jucLockInc(); break;
        }
    }

    private void jvmLockInc()
    {
        long count = ITERATIONS / numThreads;
        while (0 != count--)
        {
            synchronized (jvmLock)
            {
                ++counter;
            }
        }
    }

    private void jucLockInc()
    {
        long count = ITERATIONS / numThreads;
        while (0 != count--)
        {
            jucLock.lock();
            try
            {
                ++counter;
            }
            finally
            {
                jucLock.unlock();
            }
        }
    }
}


Script the tests:


set -x
for i in {1..8}; do java -XX:-UseBiasedLocking TestLocks JVM $i; done
for i in {1..8}; do java -XX:+UseBiasedLocking TestLocks JVM $i; done
for i in {1..8}; do java TestLocks JUC $i; done

The Results

Figure 1.

Figure 2.

Figure 3.

Observations
  1. Biased locking, in the un-contended case, is ~10% more expensive than the atomic locking.  It seems that for recent CPU generations the cost of atomic instructions is less than the necessary housekeeping for biased locks.  Previous to Nehalem, lock instructions would assert a lock on the memory bus to perform the these atomic operations, each would cost more than 100 cycles.  Since Nehalem, atomic instructions can be handled local to a CPU core, and typically cost only 10-20 cycles if they do not need to wait on the store buffer to empty while enforcing memory ordering semantics.
  2. As contention increases, language monitor locks quickly reach a throughput limit regardless of thread count.
  3. ReentrantLock gives the best un-contended performance and scales significantly better with increasing contention compared to language monitors using synchronized.
  4. ReentrantLock has an odd characteristic of reduced performance when 2 threads are contending.  This deserves further investigation.
  5. Sandybridge suffers from the increased latency of atomic instructions I detailed in a previous article when contended thread count is low.  As contended thread count continues to increase, the cost of the kernel arbitration tends to dominate and Sandybridge shows its strength with increased memory throughput.
Conclusion

Biased locking should no longer be the default lock implementation on modern Intel processors.  I recommend you measure your applications and experiement with the -XX:-UseBiasedLocking JVM option to determine if you can benefit from using atomic lock based algorithm for the un-contended case.

When developing your own concurrent libraries I would recommend ReentrantLock rather than using the synchronized keyword due to the significantly better performance on x86, if a lock-free alternative algorithm is not a viable option.

Dave Dice has pointed out that biased locking is not implemented for the locks created in the first few seconds of the JVM startup.  I'll re-run my tests this week and post the results.

 

From http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html

Published at DZone with permission of Martin Thompson, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Endre Varga replied on Tue, 2011/11/22 - 4:04am

Umm, you have not used the switch -XX:BiasedLockingStartupDelay=0, so biased locking won't have the time to start up... This is the same mistake Peter Lawrey did in the Lock vs synchronized benchmark, and it gives about a magnitude difference when the lock is not contended too much. I wanted to write an article about it, but forgot completely.

Endre Varga replied on Tue, 2011/11/22 - 4:05am

Oh, sorry, I missed the last sentence... I am very interested in the updated results!

Martin Thompson replied on Wed, 2011/11/23 - 4:03am

New post with updated results.

 http://mechanical-sympathy.blogspot.com/2011/11/biased-locking-osr-and-benchmarking-fun.html

Endre Varga replied on Wed, 2011/11/23 - 4:33am in response to: Martin Thompson

Thank you for the update!

Btw, there is a wonderful overview image about synchronization here:
http://wikis.sun.com/display/HotSpotInternals/Synchronization

Aniceto Garcia replied on Thu, 2012/01/26 - 4:40am

A lock is a tool for controlling access to a shared resource by multiple threads. Commonly, a lock provides exclusive access to a shared resource: only one thread at a time can acquire the lock and all access to the shared resource requires that the lock be acquired first. However, some locks may allow concurrent access to a shared resource, such as the read lock of a ReadWriteLock.

The use of synchronized methods or statements provides access to the implicit monitor lock associated with every object, but forces all lock acquisition and release to occur in a block-structured way: when multiple locks are acquired they must be released in the opposite order, and all locks must be released in the same lexical scope in which they were acquired.

Java Collection

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.