Experienced Java developer focused on performance problems, JVM tuning and algorithm efficiency optimization. Recently devoted to low-level aspects of Java applications performance and newest trends in concurrency and non-blocking algorithms. Wojciech has posted 6 posts at DZone. You can read more from them at their website. View Full User Profile

Abstraction slows you down

10.04.2011
| 4088 views |
  • submit to reddit

Update

As there have been several comments about my microbenchmark, I feel in a position to give a common reply.

Thank you for pointing out flaws in my test implementation. This time I used snippet from Maciej Gorączka to illustrate the pitfalls of optimization and at the same time to show real impact of interface calls. Of course it had to be modified as it also suffered from the optimization pitfalls.

There are several things in the area of method call optimizations one have to keep in mind when measuring performance. Some of them have have been described in performance techniques chapter of HotSpot internals.
With my original benchmark I did not take into account the below.

Virtual (and interface) invocations with a lopsided type profile are compiled with an optimistic check in favor of the historically common type (or two types).

So there's a lot of truth in saying microbenchmarks can be reeealy tricky.

As an example of how forgeting about optimizer may mess up your tests, the below code:

import java.math.BigDecimal;
import java.util.Random;

public class OptimizationTest
{
    private static final long ITERATIONS = 1000 * 1000 * 1000;
    private static final int RUNS = 1;
    private static final SomeInterface[] IFACES = new SomeInterface[] {new Impl1(), new Impl2()};
    private static final Impl1[] IMPLS = new Impl1[] {new Impl1(), new Impl1()};
    private static final Random RANDOM = new Random();
    private static final int CLASS_COUNT = 2;

    public static void main(String[] args)
    {
        performTestIface(2); //warm up

        final long implTime = performTestImpl(RUNS);
        final BigDecimal runs = new BigDecimal(RUNS);
        final BigDecimal implAvg = new BigDecimal(implTime).divide(runs);
        System.out.println("Invokevirtual: " + implAvg + " ms");

        final long ifaceTime = performTestIface(RUNS);
        final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(runs);
        System.out.println("Invokeinterface: " + ifaceAvg + " ms");
    }

    private static long performTestIface(final long runs)
    {
        final SomeInterface iface = IFACES[RANDOM.nextInt(CLASS_COUNT)];
        long ifaceTime = 0;

        for (int run = 0; run < runs; run++)
        {
            System.gc();
            long start = System.currentTimeMillis();
            for (int i = 0; i < ITERATIONS; i++)
            {
                iface.doSomething(i);
            }
            ifaceTime += System.currentTimeMillis() - start;
        }
        return ifaceTime;
    }

    private static long performTestImpl(final long runs)
    {
        final Impl1 impl = IMPLS[RANDOM.nextInt(CLASS_COUNT)];
        long implTime = 0;

        for (int run = 0; run < runs; run++)
        {
            System.gc();
            long start = System.currentTimeMillis();
            for (int i = 0; i < ITERATIONS; i++)
            {
                impl.doSomething(i);
            }
            implTime += System.currentTimeMillis() - start;
        }
        return implTime;
    }


    static interface SomeInterface
    {
        void doSomething(int i);
    }

    static class Impl1 implements SomeInterface
    {
        private int i;

        public void doSomething(final int i)
        {
            this.i = i;
        }
    }

    static class Impl2 implements SomeInterface
    {
        private int i;

        public void doSomething(final int i)
        {
            this.i = i;
        }
    }


}

Please make sure to run it 10+ times. I hope you'll have fun with the results.

Now, the explanation.
When run with -XX:+PrintCompilation flag, the test either outputs:

    142   9       test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
    142   1%      test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
   1701   2%      test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes)
Invokevirtual: 745 ms
   2447  10       test.InvokeInterfaceTest2::performTestIface (77 bytes)
Invokeinterface: 722 ms

 or:

      65   3       test.InvokeInterfaceTest2$Impl2::doSomething (6 bytes)
     66   1%      test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
   1523   4       test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
   1523   2%      test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes)
Invokevirtual: 732 ms
   2255   5       test.InvokeInterfaceTest2::performTestIface (77 bytes)
   2262   1%     made not entrant  test.InvokeInterfaceTest2::performTestIface @ -2 (77 bytes)
   2268   3%      test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
Invokeinterface: 1816 ms

Note that HotSpot complains about method made not entrant, which means that previously compiled method is not valid anymore. This happens because we are using 2 different implementations of the same interface like in the faster example, but this time class implementing the interface changed, so target method had to be deoptimized.

It becomes even more clear when we look at how the optimization have been implemented by the compiler. In the fast run we get native code for performTestIface:

126   B16: #    N1 <- B2  Freq: 1.01326e-06
126       movl    RSI, #-28    # int
12b       movq    RBP, [rsp + #0]    # spill
12f       movl    [rsp + #4], RAX    # spill
133       call,static  wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant')
        # OptimizationTest::performTestIface @ bci:10  L[0]=RBP L[1]=_ L[2]=_ L[3]=_ L[4]=_ L[5]=_ L[6]=_ L[7]=_ L[8]=_ STK[0]=rsp + #8 STK[1]=rsp + #4
        # OopMap{[8]=NarrowOop off=312}
138       int3    # ShouldNotReachHere

The code gets recompiled after some number of subsequent monomorphic calls reach PerMethodTrapLimit or PerBytecodeRecompilationCutoff. So for the slower run the code gets recompiled to:

14f   B16: #    N250 <- B9  Freq: 0.25578
14f       movl    RSI, #-58    # int
154       movq    [rsp + #16], RBX    # spill
159       movq    [rsp + #32], R14    # spill
15e       movq    [rsp + #40], R13    # spill
163       call,static  wrapper for: uncommon_trap(reason='bimorphic' action='maybe_recompile')
        # OptimizationTest::performTestIface @ bci:49  L[0]=rsp + #0 L[1]=_ L[2]=rsp + #40 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP STK[0]=rsp + #40 STK[1]=RBP
        # OopMap{[40]=Oop off=360}
168       int3    # ShouldNotReachHere
168
16d   B17: #    B3 B18 <- B2  Freq: 0.16983
16d       decode_heap_oop_not_null RSI,R10
171       movq    rdi, [RSI + (sizeof(oopDesc) + Klass::secondary_supers_offset_in_bytes())]
    movl    rcx, [rdi + arrayOopDesc::length_offset_in_bytes()]    # length to scan
    addq    rdi, arrayOopDex::base_offset_in_bytes(T_OBJECT)    # Skip to start of data; set NZ in case count is zero
    repne   scasq    # Scan *rdi++ for a match with rax while cx-- != 0
    jne,s   miss        # Missed: flags nz
    movq    [RSI + (sizeof(oopDesc) + Klass::secondary_super_cache_offset_in_bytes())], RAX    # Hit: update cache
    miss:   
2a7       je     B3  P=0.999999 C=-1.000000

B18: #    N250 <- B17  Freq: 1.6983e-07
2ad       movl    RSI, #-83    # int
2b2       movq    [rsp + #8], R13    # spill
2b7       movq    [rsp + #16], RBX    # spill
2bc       movq    [rsp + #32], R14    # spill
2c1       nop     # 2 bytes pad for loops and calls
2c3       call,static  wrapper for: uncommon_trap(reason='unreached' action='reinterpret')
        # OptimizationTest::performTestIface @ bci:36  L[0]=rsp + #0 L[1]=_ L[2]=rsp + #8 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP
        # OopMap{[8]=Oop off=712}
2c8       int3    # ShouldNotReachHere

Usually in high performance applications interfaces reflecting patterns like observer or listener are called in a loop with implementations changing frequently. That is why I believe this problem may have practical implications.

Thank you gents for the remarks. I obviously failed to supply flawless test proving my arguments the first time.
Hope you'll find this one solid and entertaining.

End of update

 

I think we all like to write nice code; we very often employ ideas such as composition, separation of concerns and so on. Above all we quite correctly tend to introduce various levels of abstraction by using interfaces and abstract classes. This approach is perfectly valid for outstanding number of scenarios, but may lead to performance deficiency when we're dealing with low latency requirements.

Every time we introduce another level of abstraction, at the same time we add more complexity to the procedure by which JVM resolves target method to be called.
Methods in Java are not invoked directly due to the polymorphic nature of the language. For instance:

final Set set = getImpl(); // returns different implementations  
set.add(new Object());

will have different implications than:

final HashSet set = getImpl(); // returns different implementations 
set.add(new Object());

The former would get translated into:

    ...
    NEW java/lang/Object
    DUP
    INVOKESPECIAL java/lang/Object.<init> ()V
    INVOKEINTERFACE java/util/Set.add (Ljava/lang/Object;)Z
    ...

While the latter would look more like:

    ...
    NEW java/lang/Object
    DUP
    INVOKESPECIAL java/lang/Object.<init> ()V
    INVOKEVIRTUAL java/util/HashSet.add (Ljava/lang/Object;)Z
    ...

The difference in JVM's behaviour for these two cases may not be so obvious even after thorough reading of the instructions specification. The procedure for looking up particular method in virtual method table is almost identical, but the open character of INVOKEINTERFACE semantics renders some optimizations impossible.

Let's consider the below examples. In the first case:

class Parent {
    public void method1() { }
    public void method2() { }
}

class Child extends Parent {
    public void method2() { } // overriden method
    public void method3() { }
}

This setup will result in virtual method table that goes something like:

 Parent
 1Parent.method1
 2Parent.method2
Child
 1 Parent.method1
 2Child.method2
 3Child.method3

Here we can observe how the virtual method table for Child class preserves the order of methods for its parent's class method table. It only overrides the reference/link to the overriden methods thus enabling monomorphic calls and optimizations that employ hard-linking to target methods and completely eliminate the need for method table lookup with each call. This in turn provides means for inlining methods, which can lead to a nice performance boost.

Now, let's look at the case for interfaces:

interface SomeInterface {
    void someMethod();
}

class Parent {
    public void method1() { }
    public void method2() { }
}

class Child extends Parent implements SomeInterface {
    public void method2() { } // overridden method from Parent
    public void someMethod() { }
}

class OtherClass implements SomeInterface {
    public void method3() { }
    public void someMethod() { }
}

Virtual method table for the above would resemble this:

 Parent
 1Parent.method1
 2Parent.method2
 Child
 1Parent.method1
 2Child.method2
 3SomeInterface.someMethod
 OtherClass
 1OtherClass.method3
 2SomeInterface.someMethod

As we can see, there is no order preserved in terms of index of the interface method someMethod(). For Child class, this will be method #3, but for OtherClass entry for this method would be found under #2.

The consequence of introducing these megamorphic calls is the additional cost of virtual method table lookup with each method call. To illustrate the impact of megamorphic calls we can run a simple microbenchmark:

Edit: the below code is obviously flawed (see the comments). Use the one from the top of the article.

import java.math.BigDecimal;

public class InvokeInterfaceTest
{
    private static final long ITERATIONS = 1000 * 1000 * 1000;
    private static final int RUNS = 10;
    private static long implTime;
    private static long ifaceTime;

    public static void main(String[] args)
    {
        performTest(2); //warm up
        ifaceTime = implTime = 0;
        performTest(RUNS);
        final BigDecimal implAvg = new BigDecimal(implTime).divide(new BigDecimal(RUNS));
        final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(new BigDecimal(RUNS));
        System.out.println("Invokevirtual: " + implAvg + " ms");
        System.out.println("Invokeinterface: " + ifaceAvg + " ms");
    }

    private static void performTest(final long runs)
    {
        final FooImpl impl = new FooImpl();
        final Foo iface = new FooImpl();

        for (int run = 0; run < runs; run++)
        {
            System.gc();
            long start = System.currentTimeMillis();
            for (int i = 0; i < ITERATIONS; i++)
            {
                impl.doSomething(i);
            }
            implTime += System.currentTimeMillis() - start;

            System.gc();
            start = System.currentTimeMillis();
            for (int i = 0; i < ITERATIONS; i++)
            {
                iface.doSomething(i);
            }
            ifaceTime += System.currentTimeMillis() - start;
        }
    }

    private static interface Foo
    {
        void doSomething(int i);
    }

    private static class FooImpl implements Foo
    {
        private int i;

        public void doSomething(final int i)
        {
            this.i = i; //avoid optimization
        }
    }
}

Which on my laptop with Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz using Java 1.6.0_26-b03 for 64-bit Linux gives the below results:

Invokevirtual: 735.4 ms
Invokeinterface: 1412.4 ms

In order to minimize the influence of other processeses running in my system, I pinned the test thread to a single core rendering it unavailable to other processes and achiving more stable results (as described in my previous post about java threads). 

So, If you have to deal with huge number of invocations and latency is crucial factor, then you would probably have to consider eliminating megamorphic calls at the cost of less abstract, dodgy design.

Published at DZone with permission of its author, Wojciech Kudla.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Maciej Podolak replied on Tue, 2011/10/04 - 3:31pm

Cool benchmark.

When would you reduce interfaces / abstract classes - how would you judge the trade-offs performance / testability?

Maciej Gorączka replied on Tue, 2011/10/04 - 3:38pm

Asumption is ok. The test is wrong. Please fire the test below. The results are far from what you would expect. Invoke virtual (real, existing non interface implementation) is just 1.3% faster. Which is not that great when you will take into consideration what you lose when not using interfaces.

My machine: AMD Phenom II X6 1090T overclocked to 4GHz

Results:

Invokevirtual: 545.8 ms
Invokeinterface: 552.7 ms

import java.math.BigDecimal;

public class InvokeInterfaceTest {
  private static final long ITERATIONS = 1000 * 1000 * 1000;
  private static final int RUNS = 10;

  public static void main(String[] args) {

    performTestIface(2); //warm up

    long implTime = performTestImpl(RUNS);
    final BigDecimal implAvg = new BigDecimal(implTime).divide(new BigDecimal(RUNS));
    System.out.println("Invokevirtual: " + implAvg + " ms");

    long ifaceTime = performTestIface(RUNS);
    final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(new BigDecimal(RUNS));
    System.out.println("Invokeinterface: " + ifaceAvg + " ms");
  }

  private static long performTestIface(final long runs) {

    final Foo iface = new FooImpl();
    long ifaceTime = 0;

    for (int run = 0; run < runs; run++) {

      System.gc();
      long start = System.currentTimeMillis();
      for (int i = 0; i < ITERATIONS; i++) {

        iface.doSomething(i);
      }

      ifaceTime += System.currentTimeMillis() - start;
    }

    return ifaceTime;
  }

  private static long performTestImpl(final long runs) {

    final FooImpl impl = new FooImpl();
    long implTime = 0;

    for (int run = 0; run < runs; run++) {

      System.gc();
      long start = System.currentTimeMillis();
      for (int i = 0; i < ITERATIONS; i++) {

        impl.doSomething(i);
      }

      implTime += System.currentTimeMillis() - start;
    }

    return implTime;
  }

  private static interface Foo {

    void doSomething(int i);
  }

  private static class FooImpl implements Foo {

    private int i;

    public void doSomething(final int i) {

      this.i = i; //avoid optimization
    }
  }
}

PS The above is still wrong. Please play with it and switch the invocation of performTestIface and performTestImpl methods. Next time interfaces are gonna be faster!

PPS And still. It would be a great discussion why and what exactly happened here that results changed so much.

PPPS Please visit THIS_SITE and THIS_SITE_2.  I will look into it tomorrow as soon as I get some good night sleep :D

 

Cheers Mate

Oskar Kapala replied on Wed, 2011/10/05 - 2:00am

Well, it's quite intuitive, but good to see evidence :) But abstraction speeds up development process.

Jose Smith replied on Wed, 2011/10/05 - 2:15pm

As a previous poster said, the test is flawed. Actually, if you simply flip the order of which is executed first (interface vs concrete) you will get the exact opposite results:
		for (int run = 0; run < runs; run++) {
			System.gc();
			long start = System.currentTimeMillis();
			for (int i = 0; i < ITERATIONS; i++) {
				iface.doSomething(i);
			}
			ifaceTime += System.currentTimeMillis() - start;
			
			System.gc();
			start = System.currentTimeMillis();
			for (int i = 0; i < ITERATIONS; i++) {
				impl.doSomething(i);
			}
			implTime += System.currentTimeMillis() - start;
		}


In this order of execution, I consistently get nearly opposite results:
Invokevirtual: 1362.9 ms
Invokeinterface: 671.4 ms

I suppose that's why they say to be weary of microbenchmarks.

Balázs Bessenyei replied on Wed, 2011/10/05 - 4:14pm

The JVM JIT analyzes the running code and makes the necessary optimizations. There is a default value, which can be override, which determines how many similar method invocation is need for the optimization to kick in.

The original test, way oversteps this limit, which forces the JVM to do the optimization.  Then executes the same code with a previous optimization, but with a different execution path. By the time the JVM discovers this and "corrects" itself then the test is already lost.

That is the reason that separating the 2 invocation types into separate methods equalizes it. Or swapping the order of their execution yields contradicting result.

It is extremely hard or even impossible to micro benchmark correctly a language with a strong JIT. You would get the same result with any language with a decent JIT.

Also according to Eclipse the the variable "i" is not used so it can be optimized away and the compiler will do that.

Fabrizio Giudici replied on Wed, 2011/10/05 - 4:44pm

 

I'm definitely not an expert of this type of optimizations, but clearly this is a proof of how easy is to misinterpret a microbenchmark. There are good explanations in the previous comments - I'd add that it would make more sense either to run the two tests in two different runs (I mean, restarting the main()), or start measuring time in the middle of the loop, when a good number of iterations has been executed, thus the optimization "warmed up". But I can't prevent myself from thinking that such microbenchmarks are pointless. In a real application a lot of complex interactions with the JIT optimizer occur. I think that it makes only sense to benchmark the real thing.

Wojciech Kudla replied on Wed, 2011/10/05 - 9:01pm in response to: Fabrizio Giudici

Fabrizio, I hope I managed to benchmark the real thing with my updated test :)

Wojciech Kudla replied on Wed, 2011/10/05 - 9:04pm in response to: Balázs Bessenyei

The point of assignment in doSomething method is avoiding optimization. Try leaving empty body, and you'll definitely spot the difference.

Wojciech Kudla replied on Wed, 2011/10/05 - 9:07pm in response to: Maciej Gorączka

Maciej, thank you for pointing out my flawed implementation. I believe the updated test should clear out any doubts. It still proves interface calls are slower than virtual ones. I hope that's not dissapointing :)

Wojciech Kudla replied on Thu, 2011/10/06 - 3:38am in response to: Maciej Podolak

Maciej, I think it's very hard to judge this kind of tradeoff, because each case is individual and has its own specifics. If your main concern is latency and you deal with huge number of calls, than this is probably the use case for such optimizations. In terms of testability, I think in the era of us being able to mock concrete classes and build spy objects (PowerMock, Mockito, etc.) there is almost no additional cost involved. It's just your design that hurts :)

Jakub Bielecki replied on Thu, 2011/10/06 - 9:03am

Good to see some are still up to the hardcore-supergeek challange. Flawed code shown - fuel for thought ;-) Thx!

Mladen Girazovski replied on Thu, 2011/10/06 - 3:21pm

Hi Wojciech,

i don't mean to offend, but i really disagree with the title and your proposed results.

 That is why I believe this problem may have practical implications.

So, If you have to deal with huge number of invocations and latency is crucial factor, then you would probably have to consider eliminating megamorphic calls at the cost of less abstract, dodgy design. 

 If your main concern is latency and you deal with huge number of calls, than this is probably the use case for such optimizations. ion.

May, probably... sorry, all you have proven is that microbenchmarks are useless and most of the time even misleading.

It's always easier to improve the performance if the code is well designed.

I cannot think of a use case where the duration of calls to methods depend on the fact if they are  interfaces or concrete methods in conrete classes, the processing inside the methods takes longer in real systems (ignoreing remote calls and similar). I'm not saying that it isn't possible, i'm just saying that i've never came accross such a scenario and cannot imagine one.

Lets recap some of the fundamentals about performance optimisation:

If you think about optimization before there is the need (i.e. before you notice a performance problem) it's premature optimization, and that s the root of all evil as Donald Knuth said it in the 70's.

If you optimize, you need to prove that your optimization has actually improved performance, measurably! This is done best with a profiler. Remember the 80/20 rule, if you find the few bottle necks (ie by using a profiler) and improve them, you do not have to worry about the other 80% of your code.

If you sacrifice design right from the start and just think about performance, you're not optimizing or writing performant code, you're just writing bad code ;)

I'm not saying that it shouldn't be considered to replace polymorphic calls with direct ones, i'm saying if you find out by measuring that your polymorphic calls are the bottleneck, change them, but i can hardly can think of such a scenario in the real world.

So, the title "abstractions slows you down" is worng imo, but catchy ;)

 

Wojciech Kudla replied on Sat, 2011/10/08 - 4:05am in response to: Mladen Girazovski

Mladen,

thank you for your comment. If words such as "may" or "probably" discourage your familiarizing with contents of this article, than feel free to ignore them. I think it's a matter of writing style rather than anything else, I hope you'll agree.

  sorry, all you have proven is that microbenchmarks are useless and most of the time even misleading

I definitely failed to provide flawless test in this article's original content, but openly admitted that in the update. I also provided fixed approach along with exhaustive explanation of why this phenomenon occured with the first microbenchmark.
Having said that, I can't resist an impression you did not bother to read the updated text or at least run the provided test. If you did, you would either support my statement about slow interface invocation or provide some constructive feedback about any possible flaws in this test that could distort the final result.

I cannot think of a use case where the duration of calls to methods depend on the fact if they are  interfaces or concrete methods in conrete classes, the processing inside the methods takes longer in real systems (ignoreing remote calls and similar). I'm not saying that it isn't possible, i'm just saying that i've never came accross such a scenario and cannot imagine one

I managed to provide a real-life example (with listeners and observers). If you're looking for broader scope of usages, that's HPC and HFT. As an example - quite recent annoucement from LMAX guys about releasing Disruptor 2.0. One of the improvements they introduced with v 2.0 was reducing megamorphic calls due to their cost. To be completely clear - discouraging proper design with interfaces is the last thing I would suggest. All that I intented to illustrate is the potential performance reduction in solutions that havily employ frequent interface calls.

I hope that clears out the subject. Thank you again for your input.

Bogdan Kwiatkowski replied on Mon, 2011/10/10 - 3:16am in response to: Wojciech Kudla

Impressive. Some knows interfaces affect performance but nobody knows how. Point for you for clarifying this.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.