Peter is a DZone MVB and is not an employee of DZone and has posted 162 posts at DZone. You can read more from them at their website. View Full User Profile

Hardware Transactional Memory in Java, or Why synchronized Will be Cool Again.

02.03.2014
| 4121 views |
  • submit to reddit

Hardware Transaction Memory has the potential to allow multiple threads to speculatively access the same data structure at the same time and let the cache coherence protocol determine if a conflict occurred.  HTM aims to give you the scalability of fine grain locking, the simplicity of course grain locking, and performance close to no locking at all.  If supported by the JVM, your program or library was written with course grain lock it could mean your application can scale to many more cores with little change.

While adding support for this into C and C++ is non trivial, adding support to native code generated by the JVM might be done without having to change the byte code.

In short, this could allow many threads to speculatively execute synchronized blocks for a lock concurrently, even concurrent writes, and the processor work out whether this was a problem and repeats the block until it is not.

What is Hardware Transaction Memory and how much will it cost? 

Hardware Transactional Memory has been around for some time but until recently it hasn't been main stream.  With Intel introducing support into their 4th Generation i3/i5/i7 processors (Haswell) and their E5-2600 v2 family processors, HTM is widely available to new Intel based machines.  Already, some vendors have more models of systems using the new processors than the previous generations.  AFAIK, AMD plan to add this functionality soon.

The hardware you will buy, and perhaps have already, will do this. Many new models of laptop have Haswell processors as they have significantly improved power consumption.

How might it work? 

synchronized blocks are often used in Java on a just-in-case basis.  To simplify the code, these locks are often much more coarse than is optimal e.g. Hashtable locks the whole object/Map for any operation vs ConcurrentHashMap which has fine grain locking.  Writing fine grain locking is much harder to get right and thus more error prone.  The goal of Hardware Transaction Memory is to support course grained locking but get the benefit of fine grain locking.  This is particularly useful for code where re-optimising the code is not practical.

Example
private final Map map = new HashMap<>(); 
public synchronized PooledObject acquireObject(String key) {
    PooledObjectobject = map.get(key);
    if (object == null)
        map.put(key, object = new PooledObject());
    return map;
}


You might expect the common case

  • only reads the map
  • updates the map, but in different places e.g. different keys.
  • rarely attempts to update the same key in two threads at once.



What you would like is

  • concurrent execution between threads.
  • very small over head compared to code without locking.
  • the CPU or JVM to do all the work to optimise this i.e you don't have to change the code.



Without HTM, the synchronized block needs to obtain a lock and enforce serialized access even though the majority case is a read operation.

With HTM the byte code could be turned into pseudo code like this

public PooledObject acquireObject(String key) {
    int code;
    do {
        xbegin();
        PooledObjectobject = map.get(key);
        if (object == null)
            map.put(key, object = new PooledObject());
        return map;
    } while((code = xend()) == RETRYABLE);
    if (code != DONE) {
        // take corrective action such as
        // obtain a normal lock and repeat
    }
}


The XEND instruction in the CPU checks whether any cache line accessed was modified by another CPU/thread.  If not, the changes made are committed.  Otherwise any changes are dropped and the loop can be repeated.

Note: rolling back the transaction means reverting the changes and could even mean rolling back the object creation where it doesn't have significant side effects.  If it does have side effects there is an XABORT instruction which can trigger the transaction to abort and the fall back code would need to be run.

Compare And Swap is limited to 64-bits what is the limit of these transactions? 

The limit is the number of lines you can store in your L1 cache.  This is up to 32 KB.  If you have hyper threading this might be half i.e. 16 KB. Also, the L1 cache is 8 way associative so in the worst case 9 cache lines which hash to the same bucket could cause the transaction to fail. (less with hyper threading)  Never the less, it is much higher and far more flexible than CAS 64-bit or 2CAS which is 128-bit.

Writing this transaction locking structure with fall back add boilerplate and duplicate code in a language like C.

Conclusion 

The beauty of this pattern is it can be applied to Java code which has already been compiled and available as open source libraries.  Unlike C code which would need significant rework to exploit this functionality, Java programs could take advantage of HTM without re-compilation.  What we would need is a change in the JVM.

ReferencesTransactional Synchronization Extensions on Wikipedia

Benchmarks : Haswell's TSX and Memory Transaction Throughput (HLE and RTM) from SiSoftware
Fun with Intel® Transactional Synchronization Extensions from Intel
Transactional memory support: the speculative_spin_mutex from Intel
Making Sense of the Intel Haswell Transactional Synchronization eXtensions by Johan De Gelas

Published at DZone with permission of Peter Lawrey, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)