Martin is a high-performance and low-latency specialist, with experience gained over two decades working with large scale transactional and big-data domains, including automotive, gaming, financial, mobile, and content management. He believes Mechanical Sympathy - applying an understanding of the hardware to the creation of software - is fundamental to delivering elegant, high-performance, solutions. Martin was the co-founder and CTO of LMAX, until he left to specialise in helping other people achieve great performance with their software. The Disruptor concurrent programming framework is just one example of what his mechanical sympathy has created. He blogs at mechanical-sympathy.blogspot.com Martin is a DZone MVB and is not an employee of DZone and has posted 29 posts at DZone. You can read more from them at their website. View Full User Profile

Adventures with AtomicLong

09.18.2011
| 8005 views |
  • submit to reddit

Sequencing events between threads is a common operation for many multi-threaded algorithms.  These sequences could be used for assigning identity to orders, trades, transactions, messages, events, etc.  Within the Disruptor we use a monotonic sequence for all events which is implemented as AtomicLong incrementAndGet for the multi-threaded publishing scenario.

While working on the latest version of the Disruptor I made some changes which I was convinced would improve performance, however the results surprised me.  I had removed some potentially megamorphic method calls and the performance got worse rather than better.  After a lot of investigation, I discovered that the megamorphic method calls were hiding a performance issue with the latest Intel Sandybridge processors.  With the megamorphic calls out of the way, the contention on the atomic sequence generation increased exposing the issue.  I've also observed this performance issue with other Java concurrent structures such as ArrayBlockingQueue.

I’ve been running various benchmarks on Sandybridge and have so far been impressed with performance improvements over Nehalem, especially for memory intensive applications due to the changes in its front-end.  However with this sequencing benchmark, I discovered that Sandybridge has taken a major step backward in performance with regard to atomic instructions.

Atomic instructions enable read-modify-write actions to be combined into an atomic operation.  A good example is incrementing a counter.  To complete the increment operation a thread must read the current value, increment it, and then write back the results.  In a multi-threaded environment these distinct operations could interleave with other threads doing the same with corrupt results as a consequence.  The normal way to avoid this interleaving is to take out a lock for mutual exclusion while performing the steps.  Locks are very expensive and often require kernel arbitration between threads.  Modern CPUs provide a number of atomic instructions which allow operations such as atomically incrementing a counter, or the ability to conditional set a pointer reference if the value is still as expected.  These operations are commonly referred to as CAS (Compare And Swap) instructions.  A good way to think of these CAS instructions is like optimistic locks, similar to what you experience when using a version control system like Subversion or CVS.  You try to make a change and if the version is what you expect then you succeed, otherwise the action aborts.

On x86/x64 these instructions are known as “lock” instructions.  The "lock" name comes from how a processor, after setting its lock signal, would lock the front-side/memory bus (FSB) for serialising memory access while the three steps of the operation took place atomically.  On more recent processors the lock instruction is simply implemented by getting an exclusive lock on the cache-line for modification.

These instructions are the basic building blocks used for implementing higher-level locks and semaphores.  This is, as will be explained shorty, why I've seen performing issues on Sandybridge for ArrayBlockingQueue in some of the Disruptor comparative performance tests.

Back to my benchmark.  The test was spending significantly more time in AtomicLong.incrementAndGet() than I had previously observed.  Initially, I suspected an issue with JDK 1.6.0_27 which I had just installed.  I ran the following test with various JVMs, including 1.7.0, and kept getting the same results.  I then booted different operating systems (Ubuntu, Fedora, Windows 7 - all 64-bit), again the same results.  This lead me to write an isolated test which I ran on Nehalem (2.8 GHz Core i7 860) and Sandybridge (2.2Ghz Core i7-2720QM).

import java.util.concurrent.atomic.AtomicLong;

public final class TestAtomicIncrement
    implements Runnable
{
    public static final long COUNT = 500L * 1000L * 1000L;
    public static final AtomicLong counter = new AtomicLong(0L);

    public static void main(final String[] args) throws Exception
    {
        final int numThreads = Integer.parseInt(args[0]);
        final long start = System.nanoTime();
        runTest(numThreads);
        System.out.println("duration = " + (System.nanoTime() - start));
        System.out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads)
        throws InterruptedException
    {
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestAtomicIncrement());
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = 0L;
        while (i < COUNT)
        {
            i = counter.incrementAndGet();
        }
    }
}

 

Figure 1.

After running this test on 4 different Sandybridge processors with a range of clock speeds, I concluded that using LOCK CMPXCHG, under contention with increasing numbers of threads, is much less scalable than the previous Nehalem generation of processors.  Figure 1. above charts the results in nanoseconds duration to complete 500 million increments of a counter with increasing thread count.  Less is better.

I confirmed the JVM was generating the correct instructions for the CAS loop by getting Hotspot to print the assembler it generated.  I also confirmed that Hotspot generated identical assembler instructions for both Nehalem and Sandybridge.

I then decided to investigate further and write the following C++ program to test the relevant lock instructions to compare Nehalem and Sandybridge.  I know from using “objdump -d” on the binary that the GNU Atomic Builtins generate the lock instructions for ADD, XADD, and CMPXCHG, for the respectively named functions below. 

#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <iostream>

typedef unsigned long long uint64;
const uint64 COUNT = 500LL * 1000LL * 1000LL;
volatile uint64 counter = 0;

void* run_add(void* numThreads)
{
    register uint64 value = (COUNT / *((int*)numThreads)) + 1;
    while (--value != 0)
    {
        __sync_add_and_fetch(&counter, 1);
    }
}

void* run_xadd(void*)
{
    register uint64 value = counter;
    while (value < COUNT)
    {
        value = __sync_add_and_fetch(&counter, 1);
    }
}

void* run_cas(void*)
{
    register uint64 value = 0;
    while (value < COUNT)
    {
        do
        {
            value = counter;
        }
        while (!__sync_bool_compare_and_swap(&counter, value, value + 1));
    }
}

int main (int argc, char* argv[])
{
    const int NUM_THREADS = atoi(argv[1]);

    pthread_t threads[NUM_THREADS];
    void* status;
    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS);
    }

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_join(threads[i], &status);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    std::cout << "threads = "  << NUM_THREADS << std::endl;
    std::cout << "duration = " << duration << std::endl;
    std::cout << "counter = "  << counter << std::endl;

    return 0;
}

Figure 2.

It is clear from Figure 2. that Nehalem performs nearly an order of magnitude better for atomic operations as contention increases with threads.  I found LOCK ADD and LOCK XADD to be similar so I've only charted XADD for clarity.  The CAS operations for C++ and Java are comparable.

It is also very interesting how XADD greatly outperforms CAS and gives a nice scalable profile.  For 3 threads and above, XADD does not degrade further and simply performs at the rate at which the processor can keep the caches coherent.  Nehalem and Sandybridge level out respectively at ~100m and ~20m XADD operations per second for 3+ concurrent threads, whereas CAS continues to degrade with increasing thread count because of contention.  Naturally, performance degrades when QPI links are involved for a multi-socket scenario.  Oracle have now accepted that not supporting XADD is a bug and will hopefully fix it soon for the JVM. 

As to the performance I’ve observed with Sandybridge, it would be great if others could confirm my findings so we can all feedback to Intel and have this addressed.  I've not been able to get my hands on a server class system with Sandybridge.

 

From http://mechanical-sympathy.blogspot.com/2011/09/adventures-with-atomiclong.html

Published at DZone with permission of Martin Thompson, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Slava Imeshev replied on Mon, 2011/09/19 - 9:08pm

Martin,

Nice catch!

Though, there is question common for all microbechmarks: what percenage of execution time is a normally loaded thread going to spend in incrementAndGet() ?  

Regards,

Slava Imeshev

Cacheonix | Clustered Java Cache

Martin Thompson replied on Tue, 2011/09/20 - 1:58am in response to: Slava Imeshev

Hi Slava,

It was enough for me to notice ;-)

For most applications this will not be a major issue.  There are some domains such as finance, games and military where performance is being stretched.  On a financial exchange, latency of response to high frequency traders is very important.  I've seen hundreds of transactions arrive in the same microsecond under burst conditions, and if a queuing effect kicks in because of lock instructions, then the last to arrive will see a relatively large difference in latency compared to the first.

If you are off to the database on every request you handle from users then don't lose any sleep over this :-) If your application does a lot of inter thread/process messaging, or uses locks, then you may see a real difference.  These instructions are the basic building blocks for locks.  This was a real surprise to me because everything else about Sandybridge has been so impressive.

Martin...

Wojciech Kudla replied on Wed, 2011/09/21 - 2:11am

Martin, this is indeed a very valuable finding; thanks for sharing this with us. I've been surprised to see this sort of performance degradation introduced with brand new line of Intel CPUs.
Did you have any opportunity to compare against CMPXCHG and XADD for AMD platforms?

Martin Thompson replied on Fri, 2011/09/23 - 1:56pm in response to: Wojciech Kudla

Wojciech,

I've only been able to check on a Shanghai processor (AMD Opteron Processor 2376) and they are not even close to Intel on performance.  XADD is 50% worse even than Sandybridge.  CAS is 2.5X worse than Sandybridge.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.