Peter is a DZone MVB and is not an employee of DZone and has posted 161 posts at DZone. You can read more from them at their website. View Full User Profile

C++ or Java, which is faster for high frequency trading?

  • submit to reddit

There are conflicting views as to what is the best solution for high frequency trading. Part of the problem is that what is high frequency trading varies more than you might expect, another part is what is meant by faster.

My View

If you have a typical Java programmer and typical C++ programmer, each with a few years experience writing a typical Object Oriented Program, and you give them the same amount of time, the Java programmer is likely to have a working program earlier and will have more time to tweak the application. In this situation it is likely the Java application will be faster. IMHO.

In my experience, Java performs better at C++ at detecting code which doesn't need to be done. esp micro-benchmarks which don't do anything useful. ;) If you tune Java and C++ as far as they can go given any amount of expertise and time, the C++ program will be faster. However, given limited resources and in changing environment a dynamic language will out perform. i.e. in real world applications.

In the equities space latency you need latencies sub-10 us to be seriously high frequency. Java and even standard OOP C++ on commodity hardware is not an option. You need C or a cut down version of C++ and specialist hardware like FPGAs, GPUs.

In FX, high frequency means a latencies of sub-100 us. In this space C++ or a cut down Java (low GC) with kernel bypass network adapter is an option. In this space, using one language or another will have pluses and minuses. Personally, I think Java gives more flexibility as the exchanges are constantly changing, assuming you believe you can use IT for competitive advantage.

In many cases, when people talk about high frequency, esp Banks, they are talking sub 1 ms or single digit ms. In this space, I would say the flexibility/dynamic programming of Java, Scala or C# etc would give you time to market, maintainability and reliability advantages over C/C++ or FPGA.

The problem Java faces

The problem is not in the language as such, but a lack of control over caches, context switches and interrupts. If you copy a block of memory, something which occurs in native memory, but using a different delay between runs, that copy gets slower depending on what has happened between copies.

The problem is not GC, or Java as neither of these play much of a part. The problem is that part of the cache has been swapped out and the copy itself takes longer. This is the same for any operation which accesses memory. e.g. accessing plain objects will also be slower.

private void doTest(Pauser delay) throws InterruptedException {
    int[] times = new int[1000 * 1000];
    byte[] bytes = new byte[32* 1024];
    byte[] bytes2 = new byte[32 * 1024];
    long end = System.nanoTime() + (long) 5e9;
    int i;
    for (i = 0; i < times.length; i++) {
        long start = System.nanoTime();
        System.arraycopy(bytes, 0, bytes2, 0, bytes.length);
        long time = System.nanoTime() - start;
        times[i] = (int) time;
        if (start > end) break;
    Arrays.sort(times, 0, i);
    System.out.printf(delay + ": Copy memory latency 1/50/99%%tile %.1f/%.1f/%.1f us%n",
            times[i / 100] / 1e3,
            times[i / 2] / 1e3,
            times[i - i / 100 - 1] / 1e3
The test does the same thing many times, with different delays between performing that test. The test spends most of its time in native methods and no objects are created or discarded as during the test.
YIELD: Copy memory latency 1/50/99%tile 1.6/1.6/2.3 us
NO_WAIT: Copy memory latency 1/50/99%tile 1.6/1.6/1.6 us
BUSY_WAIT_10: Copy memory latency 1/50/99%tile 3.1/3.5/4.4 us
BUSY_WAIT_3: Copy memory latency 1/50/99%tile 2.7/3.0/4.0 us
BUSY_WAIT_1: Copy memory latency 1/50/99%tile 1.6/1.6/2.6 us
SLEEP_10: Copy memory latency 1/50/99%tile 2.3/3.7/5.2 us
SLEEP_3: Copy memory latency 1/50/99%tile 2.7/4.4/4.8 us
SLEEP_1: Copy memory latency 1/50/99%tile 2.8/4.6/5.0 us
The typical time (the middle value) it takes to perform the memory copy varies between 1.6 and 4.6 us depending on whether there was a busy wait or sleep for 1 to 10 ms. This is a ratio of about 3x which has nothing to do with Java, but something it has no real control over. Even the best times vary by about 2x.

The code


In ultra-high frequency, the core engine will be more C, assembly and custom hardware than OOP C++ or Java. In markets where the latency requirements of the engine are less tight C++ and Low GC Java become an option. As latency requirement become less tight, Java and other dynamic languages can be more productive. In this situation, Java is faster to market so you can take advantages of changes in the market/requirements.



Published at DZone with permission of Peter Lawrey, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)



Victor Tsoukanov replied on Thu, 2011/08/04 - 12:16am

Which is faster Assembler or C++? Sure, C++ would be a loser in every way. Your question is not correct. You are selecting particular language for development at first and if you need to have performance you need to have scalable design which does not depend on language at all.

Arnaud Des_vosges replied on Thu, 2011/08/04 - 4:44am in response to: Victor Tsoukanov

At runtime, there are not many differences between assembler and C/C++ (except virual methods/RTTI maybe), while the JVM works on a memory/object model at a higher level (GC, range/null check,...) even if the code itself is JIT-ed.

What matters is managed code (Java/Scala/C#/...) vs non-managed code (Asm/C/C++).

A nice aspect of Java/JVM is that it's relatively "close to the metal" at runtime to have correct performance (compared to some recent dynamic languages for instance) and at the same time it still provides a much better level of abstraction than non-managed languages.

dieter von holten replied on Thu, 2011/08/04 - 6:30am

can you provide some technical background info on what else is going on in this context? how is stock-price distributed? with a broadcast to all interested parties? is it polled from somewhere? what happens when your app calculates something from the data and issues a buy-order to the stock-server? how does it arrive there? what happens there/then? first come/first served? is anything logged (presumably after the decision is made)? that would block the thread for a while before it can handle another request. are any firewalls involved? what is a typical hardware/software platform for that?

to me, investigating this sub-milliseconds time-wasters is technically interesting - but does it really matter when there so many 'imponderabilities' in the whole round-trip?


Mike P(Okidoky) replied on Thu, 2011/08/04 - 8:54am

" If you copy a block of memory, something which occurs in native memory, but using a different delay between runs, that copy gets slower depending on what has happened between copies. "

Could that be because you're calling doTest just once from your main in your test app/case? In that case, doTest might not have been fully machine-codyfied and optimized. Same mistake as that shoot-out website kept making: warm up time, not enough efforts to iron out any warm up time.

The solution in these micro-benchmarks was to call doTest multiple times (in your case). The second and third runs should show different results than the first run.

Michael Barker replied on Thu, 2011/08/04 - 11:06am

I ran the test with -XX:+PrintCompilation switched on. I think hotspot is kicking in on some of the runs which may explain some of the jitter that you are seeing in the busy wait cases which "should" be fairly stable. In the yielding and sleeping cases a syscall is required, which a C/C++ implementation would not be able to avoid so would likely exhibit similar behaviour.
Warmup - ---   n   java.lang.System::nanoTime (static)
---   n   java.lang.System::arraycopy (static)
  1%      ThreadLatencyTest::doTest @ 31 (182 bytes)
  1%  made not entrant  ThreadLatencyTest::doTest @ -2 (182 bytes)
  2%      java.util.Arrays::sort1 @ 223 (396 bytes)
  1       java.util.Arrays::swap (15 bytes)
  3%      java.util.Arrays::sort1 @ 181 (396 bytes)
  4%      java.util.Arrays::vecswap @ 3 (28 bytes)
  2       java.util.Arrays::sort1 (396 bytes)
  3       java.util.Arrays::vecswap (28 bytes)
NO_WAIT: Copy memory latency 1/50/99%tile 3.2/5.0/7.8 us
  4       ThreadLatencyTest::doTest (182 bytes)
  5%      ThreadLatencyTest::doTest @ 31 (182 bytes)
---   n   java.lang.Thread::yield (static)
YIELD: Copy memory latency 1/50/99%tile 3.1/4.8/6.7 us
NO_WAIT: Copy memory latency 1/50/99%tile 3.1/4.5/7.3 us
  6%      ThreadLatencyTest$Pauser$3::pause @ 4 (18 bytes)
  5       ThreadLatencyTest$Pauser$3::pause (18 bytes)
BUSY_WAIT_10: Copy memory latency 1/50/99%tile 5.2/8.0/14.5 us
  6       ThreadLatencyTest$Pauser$4::pause (18 bytes)
  7%      ThreadLatencyTest$Pauser$4::pause @ 4 (18 bytes)
BUSY_WAIT_3: Copy memory latency 1/50/99%tile 4.7/7.1/9.8 us
  7       ThreadLatencyTest$Pauser$5::pause (18 bytes)
  5%  made not entrant  ThreadLatencyTest::doTest @ -2 (182 bytes)
  8%      ThreadLatencyTest::doTest @ 31 (182 bytes)
BUSY_WAIT_1: Copy memory latency 1/50/99%tile 4.9/7.0/9.6 us
  4   made not entrant  ThreadLatencyTest::doTest (182 bytes)
SLEEP_10: Copy memory latency 1/50/99%tile 8.9/10.2/12.7 us
SLEEP_3: Copy memory latency 1/50/99%tile 8.0/10.2/19.6 us
SLEEP_1: Copy memory latency 1/50/99%tile 5.0/7.6/13.8 us

JD Evora replied on Thu, 2011/08/04 - 12:10pm



wasn't ment for avoid those variable values?



Mike P(Okidoky) replied on Thu, 2011/08/04 - 2:42pm

This means that all the observations and assumptions ("The problem Java faces") should be reconsidered.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.