Martin is a high-performance and low-latency specialist, with experience gained over two decades working with large scale transactional and big-data domains, including automotive, gaming, financial, mobile, and content management. He believes Mechanical Sympathy - applying an understanding of the hardware to the creation of software - is fundamental to delivering elegant, high-performance, solutions. Martin was the co-founder and CTO of LMAX, until he left to specialise in helping other people achieve great performance with their software. The Disruptor concurrent programming framework is just one example of what his mechanical sympathy has created. He blogs at mechanical-sympathy.blogspot.com Martin is a DZone MVB and is not an employee of DZone and has posted 29 posts at DZone. You can read more from them at their website. View Full User Profile

Biased Locking, OSR, and Benchmarking Fun

11.27.2011
| 3839 views |
  • submit to reddit

After my last post on Java Lock Implementations, I got a lot of good feedback about my results and micro-benchmark design approach.  As a result I now understand JVM warmup, On Stack Replacement (OSR) and Biased Locking somewhat better than before.  Special thanks to Dave Dice from Oracle, and Cliff Click & Gil Tene from Azul, for their very useful feedback.

In the last post I concluded, based on my experiments, that biased locking was no longer necessary on modern CPUs.  While this conclusion is understandable given the data gathered in the experiment, it was not valid because the experiment did not take account of some JVM warm up behaviour that I was unaware of.

In this post I will re-run the experiment taking into account the feedback and present some new results.  I shall also expand on the changes I've made to the test and why it is important to consider the JVM warm-up behaviour when writing micro-benchmarks, or even very lean Java applications with quick start up time.

On Stack Replacement (OSR)

Java virtual machines will compile code to achieve greater performance based on runtime profiling.  Some VMs run an interpreter for the majority of code and replace hot areas with compiled code following the 80/20 rule.  Other VMs compile all code simply at first then replace the simple code with more optimised code based on profiling.  Oracle Hotspot and Azul are examples of the first type and Oracle JRockit is an example of the second.

Oracle Hotspot will count invocations of a method return plus branch backs for loops in that method, and if this exceeds 10K in server mode the method will be compiled.  The compiled code on normal JIT'ing can be used when the method is next called.  However if a loop is still iterating it may make sense to replace the method before the loop completes, especially if it has many iterations to go.  OSR is the means by which a method gets replaced with a compiled version part way through iterating a loop.

I was under the impression that normal JIT'ing and OSR would result in similar code.  Cliff Click pointed out that it is much harder for a runtime to optimise a loop part way through, and especially difficult if nested.  For example, bounds checking within the loop may not be possible to eliminate. Cliff will blog in more detail on this shortly.

What this means is that you are likely to get better optimised code by doing a small number of shorter warm ups than a single large one.  You can see in the code below how I do 10 shorter runs in a loop before the main large run compared to the last article where I did a single large warm-up run.

Biased Locking

Dave Dice pointed out that Hotspot does not enable objects for biased locking in the first few seconds (4s at present) of JVM startup. This is because some benchmarks, and NetBeans, have a lot of thread contention on start up and the revocation cost is significant.  Other VMs such as Azul use biased locking right from the start which is not an issue because their revocation is cheap.

All objects by default are created with biased locking enabled in Oracle Hotspot after the first few seconds of start-up delay, and can be configured with -XX:BiasedLockingStartupDelay=0.

This point, combined with knowing more about OSR, is important for micro-benchmarks.  It is also important to be aware of these points if you have a lean Java application that starts in a few seconds.

The Code

import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;

import static java.lang.System.out;

public final class TestLocks implements Runnable
{
    public enum LockType { JVM, JUC }
    public static LockType lockType;

    public static final long WARMUP_ITERATIONS = 100L * 1000L;
    public static final long ITERATIONS = 500L * 1000L * 1000L;
    public static long counter = 0L;

    public static final Object jvmLock = new Object();
    public static final Lock jucLock = new ReentrantLock();
    private static int numThreads;

    private final long iterationLimit;
    private final CyclicBarrier barrier;

    public TestLocks(final CyclicBarrier barrier, final long iterationLimit)
    {
        this.barrier = barrier;
        this.iterationLimit = iterationLimit;
    }

    public static void main(final String[] args) throws Exception
    {
        lockType = LockType.valueOf(args[0]);
        numThreads = Integer.parseInt(args[1]);

        for (int i = 0; i < 10; i++)
        {
            runTest(numThreads, WARMUP_ITERATIONS);
            counter = 0L;
        }

        final long start = System.nanoTime();
        runTest(numThreads, ITERATIONS);
        final long duration = System.nanoTime() - start;

        out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
        out.printf("%,d ns/op\n", duration / ITERATIONS);
        out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
        out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads, final long iterationLimit)
        throws Exception
    {
        CyclicBarrier barrier = new CyclicBarrier(numThreads);
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestLocks(barrier, iterationLimit));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        try
        {
            barrier.await();
        }
        catch (Exception e)
        {
            // don't care
        }

        switch (lockType)
        {
            case JVM: jvmLockInc(); break;
            case JUC: jucLockInc(); break;
        }
    }

    private void jvmLockInc()
    {
        long count = iterationLimit / numThreads;
        while (0 != count--)
        {
            synchronized (jvmLock)
            {
                ++counter;
            }
        }
    }

    private void jucLockInc()
    {
        long count = iterationLimit / numThreads;
        while (0 != count--)
        {
            jucLock.lock();
            try
            {
                ++counter;
            }
            finally
            {
                jucLock.unlock();
            }
        }
    }
}



Script to run tests:

set -x
for i in {1..8}
do 
    java -server -XX:-UseBiasedLocking TestLocks JVM $i
done

for i in {1..8}
do 
    java -server -XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0 TestLocks JVM $i
done

for i in {1..8}
do 
    java -server TestLocks JUC $i
done 



Results

The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.

Nehalem 2.8GHz - Ops/Sec
Threads-UseBiasedLocking+UseBiasedLockingReentrantLock
153,283,461
450,950,969
62,876,566
218,519,295
18,108,615
10,217,186
313,349,605
13,416,198
14,108,622
48,120,172
8,040,773
14,207,310
54,725,114
4,551,766
14,302,683
65,133,706
5,246,548
14,676,616
75,473,652
5,585,666
18,145,525
85,514,056
5,414,171
19,010,725


Sandybridge 2.0GHz - Ops/Sec
Threads-UseBiasedLocking+UseBiasedLockingReentrantLock
1
34,500,407
396,511,324
43,148,808
2
20,899,076
19,742,639
6,038,923
3
9,288,039
11,957,032
24,147,807
4
5,618,862
5,589,289
9,082,961
5
5,609,932
5,592,574
9,389,243
6
5,742,907
5,760,558
12,518,728
7
6,699,201
6,641,886
13,684,475
8
6,957,824
6,925,410
14,819,005

Observations
  1. Biased locking has a huge benefit in the un-contended single threaded case.
  2. Biased locking when un-contended, and not revoked, only adds 4-5 cycles of cost.  This is the cost when having a cache hit for the lock structures, on top of the code protected in the critical section.
  3. -XX:BiasedLockingStartupDelay=0 needs to be set for lean applications and micro-benchmarks.
  4. Avoiding OSR does not make a material difference to this set of test results.  This is likely to be because the loop is so simple or other costs are dominating.
  5. For the current implementations, ReentrantLocks scale better than synchronised locks under contention, except in the case of 2 contending threads.
Conclusion

My tests in the last post are invalid for the testing of an un-contended biased lock, because the lock was not actually biased.  If you are designing code following the single writer principle, and therefore having un-contended locks when using 3rd party libraries, then having biased locking enabled is a significant performance boost.

 

From http://mechanical-sympathy.blogspot.com/2011/11/biased-locking-osr-and-benchmarking-fun.html

Published at DZone with permission of Martin Thompson, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Amara Amjad replied on Sun, 2012/03/25 - 1:02am

Great post!

I 'd like to ask a couple of questions. You mention that sometime after 10k iterations, compilation of the method happens, and that if you repeat this warmup phase more times, better optimization of the method's code is likely to occur. In your code you do 10 iterations, 100k each. Is there any neat trick behind that?:P

Regarding the values on the tables, are they from a single run or are they the average/min/max values of a number of runs?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.