Abstraction slows you down
Update
As there have been several comments about my microbenchmark, I feel in a position to give a common reply.
Thank you for pointing out flaws in my test implementation. This time I used snippet from Maciej Gorączka to illustrate the pitfalls of optimization and at the same time to show real impact of interface calls. Of course it had to be modified as it also suffered from the optimization pitfalls.
There are several things in the area of method call optimizations one have to keep in mind when measuring performance. Some of them have have been described in performance techniques chapter of HotSpot internals.
With my original benchmark I did not take into account the below.
Virtual (and interface) invocations with a lopsided type profile are compiled with an optimistic check in favor of the historically common type (or two types).
So there's a lot of truth in saying microbenchmarks can be reeealy tricky.
As an example of how forgeting about optimizer may mess up your tests, the below code:
import java.math.BigDecimal;
import java.util.Random;
public class OptimizationTest
{
private static final long ITERATIONS = 1000 * 1000 * 1000;
private static final int RUNS = 1;
private static final SomeInterface[] IFACES = new SomeInterface[] {new Impl1(), new Impl2()};
private static final Impl1[] IMPLS = new Impl1[] {new Impl1(), new Impl1()};
private static final Random RANDOM = new Random();
private static final int CLASS_COUNT = 2;
public static void main(String[] args)
{
performTestIface(2); //warm up
final long implTime = performTestImpl(RUNS);
final BigDecimal runs = new BigDecimal(RUNS);
final BigDecimal implAvg = new BigDecimal(implTime).divide(runs);
System.out.println("Invokevirtual: " + implAvg + " ms");
final long ifaceTime = performTestIface(RUNS);
final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(runs);
System.out.println("Invokeinterface: " + ifaceAvg + " ms");
}
private static long performTestIface(final long runs)
{
final SomeInterface iface = IFACES[RANDOM.nextInt(CLASS_COUNT)];
long ifaceTime = 0;
for (int run = 0; run < runs; run++)
{
System.gc();
long start = System.currentTimeMillis();
for (int i = 0; i < ITERATIONS; i++)
{
iface.doSomething(i);
}
ifaceTime += System.currentTimeMillis() - start;
}
return ifaceTime;
}
private static long performTestImpl(final long runs)
{
final Impl1 impl = IMPLS[RANDOM.nextInt(CLASS_COUNT)];
long implTime = 0;
for (int run = 0; run < runs; run++)
{
System.gc();
long start = System.currentTimeMillis();
for (int i = 0; i < ITERATIONS; i++)
{
impl.doSomething(i);
}
implTime += System.currentTimeMillis() - start;
}
return implTime;
}
static interface SomeInterface
{
void doSomething(int i);
}
static class Impl1 implements SomeInterface
{
private int i;
public void doSomething(final int i)
{
this.i = i;
}
}
static class Impl2 implements SomeInterface
{
private int i;
public void doSomething(final int i)
{
this.i = i;
}
}
}Please make sure to run it 10+ times. I hope you'll have fun with the results.
Now, the explanation.
When run with -XX:+PrintCompilation flag, the test either outputs:
142 9 test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
142 1% test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
1701 2% test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes)
Invokevirtual: 745 ms
2447 10 test.InvokeInterfaceTest2::performTestIface (77 bytes)
Invokeinterface: 722 ms
or:
65 3 test.InvokeInterfaceTest2$Impl2::doSomething (6 bytes)
66 1% test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
1523 4 test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
1523 2% test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes)
Invokevirtual: 732 ms
2255 5 test.InvokeInterfaceTest2::performTestIface (77 bytes)
2262 1% made not entrant test.InvokeInterfaceTest2::performTestIface @ -2 (77 bytes)
2268 3% test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
Invokeinterface: 1816 ms
Note that HotSpot complains about method made not entrant, which means that previously compiled method is not valid anymore. This happens because we are using 2 different implementations of the same interface like in the faster example, but this time class implementing the interface changed, so target method had to be deoptimized.
It becomes even more clear when we look at how the optimization have been implemented by the compiler. In the fast run we get native code for performTestIface:
126 B16: # N1 <- B2 Freq: 1.01326e-06
126 movl RSI, #-28 # int
12b movq RBP, [rsp + #0] # spill
12f movl [rsp + #4], RAX # spill
133 call,static wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant')
# OptimizationTest::performTestIface @ bci:10 L[0]=RBP L[1]=_ L[2]=_ L[3]=_ L[4]=_ L[5]=_ L[6]=_ L[7]=_ L[8]=_ STK[0]=rsp + #8 STK[1]=rsp + #4
# OopMap{[8]=NarrowOop off=312}
138 int3 # ShouldNotReachHere
The code gets recompiled after some number of subsequent monomorphic calls reach PerMethodTrapLimit or PerBytecodeRecompilationCutoff. So for the slower run the code gets recompiled to:
14f B16: # N250 <- B9 Freq: 0.25578
14f movl RSI, #-58 # int
154 movq [rsp + #16], RBX # spill
159 movq [rsp + #32], R14 # spill
15e movq [rsp + #40], R13 # spill
163 call,static wrapper for: uncommon_trap(reason='bimorphic' action='maybe_recompile')
# OptimizationTest::performTestIface @ bci:49 L[0]=rsp + #0 L[1]=_ L[2]=rsp + #40 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP STK[0]=rsp + #40 STK[1]=RBP
# OopMap{[40]=Oop off=360}
168 int3 # ShouldNotReachHere
168
16d B17: # B3 B18 <- B2 Freq: 0.16983
16d decode_heap_oop_not_null RSI,R10
171 movq rdi, [RSI + (sizeof(oopDesc) + Klass::secondary_supers_offset_in_bytes())]
movl rcx, [rdi + arrayOopDesc::length_offset_in_bytes()] # length to scan
addq rdi, arrayOopDex::base_offset_in_bytes(T_OBJECT) # Skip to start of data; set NZ in case count is zero
repne scasq # Scan *rdi++ for a match with rax while cx-- != 0
jne,s miss # Missed: flags nz
movq [RSI + (sizeof(oopDesc) + Klass::secondary_super_cache_offset_in_bytes())], RAX # Hit: update cache
miss:
2a7 je B3 P=0.999999 C=-1.000000B18: # N250 <- B17 Freq: 1.6983e-07
2ad movl RSI, #-83 # int
2b2 movq [rsp + #8], R13 # spill
2b7 movq [rsp + #16], RBX # spill
2bc movq [rsp + #32], R14 # spill
2c1 nop # 2 bytes pad for loops and calls
2c3 call,static wrapper for: uncommon_trap(reason='unreached' action='reinterpret')
# OptimizationTest::performTestIface @ bci:36 L[0]=rsp + #0 L[1]=_ L[2]=rsp + #8 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP
# OopMap{[8]=Oop off=712}
2c8 int3 # ShouldNotReachHere
Usually in high performance applications interfaces reflecting patterns like observer or listener are called in a loop with implementations changing frequently. That is why I believe this problem may have practical implications.
Thank you gents for the remarks. I obviously failed to supply flawless test proving my arguments the first time.
Hope you'll find this one solid and entertaining.
End of update
I think we all like to write nice code; we very often employ ideas such as composition, separation of concerns and so on. Above all we quite correctly tend to introduce various levels of abstraction by using interfaces and abstract classes. This approach is perfectly valid for outstanding number of scenarios, but may lead to performance deficiency when we're dealing with low latency requirements.
Every time we introduce another level of abstraction, at the same time we add more complexity to the procedure by which JVM resolves target method to be called.
Methods in Java are not invoked directly due to the polymorphic nature of the language. For instance:
final Set set = getImpl(); // returns different implementations set.add(new Object());
will have different implications than:
final HashSet set = getImpl(); // returns different implementations set.add(new Object());
The former would get translated into:
...
NEW java/lang/Object
DUP
INVOKESPECIAL java/lang/Object.<init> ()V
INVOKEINTERFACE java/util/Set.add (Ljava/lang/Object;)Z
...While the latter would look more like:
...
NEW java/lang/Object
DUP
INVOKESPECIAL java/lang/Object.<init> ()V
INVOKEVIRTUAL java/util/HashSet.add (Ljava/lang/Object;)Z
...The difference in JVM's behaviour for these two cases may not be so obvious even after thorough reading of the instructions specification. The procedure for looking up particular method in virtual method table is almost identical, but the open character of INVOKEINTERFACE semantics renders some optimizations impossible.
Let's consider the below examples. In the first case:
class Parent {
public void method1() { }
public void method2() { }
}
class Child extends Parent {
public void method2() { } // overriden method
public void method3() { }
}This setup will result in virtual method table that goes something like:
| Parent | |
| 1 | Parent.method1 |
| 2 | Parent.method2 |
| Child | |
| 1 | Parent.method1 |
| 2 | Child.method2 |
| 3 | Child.method3 |
Here we can observe how the virtual method table for Child class preserves the order of methods for its parent's class method table. It only overrides the reference/link to the overriden methods thus enabling monomorphic calls and optimizations that employ hard-linking to target methods and completely eliminate the need for method table lookup with each call. This in turn provides means for inlining methods, which can lead to a nice performance boost.
Now, let's look at the case for interfaces:
interface SomeInterface {
void someMethod();
}
class Parent {
public void method1() { }
public void method2() { }
}
class Child extends Parent implements SomeInterface {
public void method2() { } // overridden method from Parent
public void someMethod() { }
}
class OtherClass implements SomeInterface {
public void method3() { }
public void someMethod() { }
}Virtual method table for the above would resemble this:
| Parent | |
| 1 | Parent.method1 |
| 2 | Parent.method2 |
| Child | |
| 1 | Parent.method1 |
| 2 | Child.method2 |
| 3 | SomeInterface.someMethod |
| OtherClass | |
| 1 | OtherClass.method3 |
| 2 | SomeInterface.someMethod |
As we can see, there is no order preserved in terms of index of the interface method someMethod(). For Child class, this will be method #3, but for OtherClass entry for this method would be found under #2.
The consequence of introducing these megamorphic calls is the additional cost of virtual method table lookup with each method call. To illustrate the impact of megamorphic calls we can run a simple microbenchmark:
Edit: the below code is obviously flawed (see the comments). Use the one from the top of the article.
import java.math.BigDecimal;
public class InvokeInterfaceTest
{
private static final long ITERATIONS = 1000 * 1000 * 1000;
private static final int RUNS = 10;
private static long implTime;
private static long ifaceTime;
public static void main(String[] args)
{
performTest(2); //warm up
ifaceTime = implTime = 0;
performTest(RUNS);
final BigDecimal implAvg = new BigDecimal(implTime).divide(new BigDecimal(RUNS));
final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(new BigDecimal(RUNS));
System.out.println("Invokevirtual: " + implAvg + " ms");
System.out.println("Invokeinterface: " + ifaceAvg + " ms");
}
private static void performTest(final long runs)
{
final FooImpl impl = new FooImpl();
final Foo iface = new FooImpl();
for (int run = 0; run < runs; run++)
{
System.gc();
long start = System.currentTimeMillis();
for (int i = 0; i < ITERATIONS; i++)
{
impl.doSomething(i);
}
implTime += System.currentTimeMillis() - start;
System.gc();
start = System.currentTimeMillis();
for (int i = 0; i < ITERATIONS; i++)
{
iface.doSomething(i);
}
ifaceTime += System.currentTimeMillis() - start;
}
}
private static interface Foo
{
void doSomething(int i);
}
private static class FooImpl implements Foo
{
private int i;
public void doSomething(final int i)
{
this.i = i; //avoid optimization
}
}
}
Which on my laptop with Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz using Java 1.6.0_26-b03 for 64-bit Linux gives the below results:
Invokevirtual: 735.4 ms
Invokeinterface: 1412.4 ms
In order to minimize the influence of other processeses running in my system, I pinned the test thread to a single core rendering it unavailable to other processes and achiving more stable results (as described in my previous post about java threads).
So, If you have to deal with huge number of invocations and latency is crucial factor, then you would probably have to consider eliminating megamorphic calls at the cost of less abstract, dodgy design.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)





Comments
Maciej Podolak replied on Tue, 2011/10/04 - 3:31pm
Cool benchmark.
When would you reduce interfaces / abstract classes - how would you judge the trade-offs performance / testability?
Maciej Gorączka replied on Tue, 2011/10/04 - 3:38pm
Asumption is ok. The test is wrong. Please fire the test below. The results are far from what you would expect. Invoke virtual (real, existing non interface implementation) is just 1.3% faster. Which is not that great when you will take into consideration what you lose when not using interfaces.
My machine: AMD Phenom II X6 1090T overclocked to 4GHz
Results:
Invokevirtual: 545.8 ms
Invokeinterface: 552.7 ms
import java.math.BigDecimal; public class InvokeInterfaceTest { private static final long ITERATIONS = 1000 * 1000 * 1000; private static final int RUNS = 10; public static void main(String[] args) { performTestIface(2); //warm up long implTime = performTestImpl(RUNS); final BigDecimal implAvg = new BigDecimal(implTime).divide(new BigDecimal(RUNS)); System.out.println("Invokevirtual: " + implAvg + " ms"); long ifaceTime = performTestIface(RUNS); final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(new BigDecimal(RUNS)); System.out.println("Invokeinterface: " + ifaceAvg + " ms"); } private static long performTestIface(final long runs) { final Foo iface = new FooImpl(); long ifaceTime = 0; for (int run = 0; run < runs; run++) { System.gc(); long start = System.currentTimeMillis(); for (int i = 0; i < ITERATIONS; i++) { iface.doSomething(i); } ifaceTime += System.currentTimeMillis() - start; } return ifaceTime; } private static long performTestImpl(final long runs) { final FooImpl impl = new FooImpl(); long implTime = 0; for (int run = 0; run < runs; run++) { System.gc(); long start = System.currentTimeMillis(); for (int i = 0; i < ITERATIONS; i++) { impl.doSomething(i); } implTime += System.currentTimeMillis() - start; } return implTime; } private static interface Foo { void doSomething(int i); } private static class FooImpl implements Foo { private int i; public void doSomething(final int i) { this.i = i; //avoid optimization } } }PS The above is still wrong. Please play with it and switch the invocation of performTestIface and performTestImpl methods. Next time interfaces are gonna be faster!
PPS And still. It would be a great discussion why and what exactly happened here that results changed so much.
PPPS Please visit THIS_SITE and THIS_SITE_2. I will look into it tomorrow as soon as I get some good night sleep :D
Cheers Mate
Oskar Kapala replied on Wed, 2011/10/05 - 2:00am
Jose Smith replied on Wed, 2011/10/05 - 2:15pm
for (int run = 0; run < runs; run++) { System.gc(); long start = System.currentTimeMillis(); for (int i = 0; i < ITERATIONS; i++) { iface.doSomething(i); } ifaceTime += System.currentTimeMillis() - start; System.gc(); start = System.currentTimeMillis(); for (int i = 0; i < ITERATIONS; i++) { impl.doSomething(i); } implTime += System.currentTimeMillis() - start; }In this order of execution, I consistently get nearly opposite results:
I suppose that's why they say to be weary of microbenchmarks.
Balázs Bessenyei replied on Wed, 2011/10/05 - 4:14pm
The JVM JIT analyzes the running code and makes the necessary optimizations. There is a default value, which can be override, which determines how many similar method invocation is need for the optimization to kick in.
The original test, way oversteps this limit, which forces the JVM to do the optimization. Then executes the same code with a previous optimization, but with a different execution path. By the time the JVM discovers this and "corrects" itself then the test is already lost.
That is the reason that separating the 2 invocation types into separate methods equalizes it. Or swapping the order of their execution yields contradicting result.
It is extremely hard or even impossible to micro benchmark correctly a language with a strong JIT. You would get the same result with any language with a decent JIT.
Also according to Eclipse the the variable "i" is not used so it can be optimized away and the compiler will do that.
Fabrizio Giudici replied on Wed, 2011/10/05 - 4:44pm
I'm definitely not an expert of this type of optimizations, but clearly this is a proof of how easy is to misinterpret a microbenchmark. There are good explanations in the previous comments - I'd add that it would make more sense either to run the two tests in two different runs (I mean, restarting the main()), or start measuring time in the middle of the loop, when a good number of iterations has been executed, thus the optimization "warmed up". But I can't prevent myself from thinking that such microbenchmarks are pointless. In a real application a lot of complex interactions with the JIT optimizer occur. I think that it makes only sense to benchmark the real thing.
Wojciech Kudla replied on Wed, 2011/10/05 - 9:01pm
in response to:
Fabrizio Giudici
Wojciech Kudla replied on Wed, 2011/10/05 - 9:04pm
in response to:
Balázs Bessenyei
Wojciech Kudla replied on Wed, 2011/10/05 - 9:07pm
in response to:
Maciej Gorączka
Wojciech Kudla replied on Thu, 2011/10/06 - 3:38am
in response to:
Maciej Podolak
Jakub Bielecki replied on Thu, 2011/10/06 - 9:03am
Mladen Girazovski replied on Thu, 2011/10/06 - 3:21pm
Hi Wojciech,
i don't mean to offend, but i really disagree with the title and your proposed results.
May, probably... sorry, all you have proven is that microbenchmarks are useless and most of the time even misleading.
It's always easier to improve the performance if the code is well designed.I cannot think of a use case where the duration of calls to methods depend on the fact if they are interfaces or concrete methods in conrete classes, the processing inside the methods takes longer in real systems (ignoreing remote calls and similar). I'm not saying that it isn't possible, i'm just saying that i've never came accross such a scenario and cannot imagine one.
Lets recap some of the fundamentals about performance optimisation:
If you think about optimization before there is the need (i.e. before you notice a performance problem) it's premature optimization, and that s the root of all evil as Donald Knuth said it in the 70's.
If you optimize, you need to prove that your optimization has actually improved performance, measurably! This is done best with a profiler. Remember the 80/20 rule, if you find the few bottle necks (ie by using a profiler) and improve them, you do not have to worry about the other 80% of your code.
If you sacrifice design right from the start and just think about performance, you're not optimizing or writing performant code, you're just writing bad code ;)
I'm not saying that it shouldn't be considered to replace polymorphic calls with direct ones, i'm saying if you find out by measuring that your polymorphic calls are the bottleneck, change them, but i can hardly can think of such a scenario in the real world.
So, the title "abstractions slows you down" is worng imo, but catchy ;)
Wojciech Kudla replied on Sat, 2011/10/08 - 4:05am
in response to:
Mladen Girazovski
Mladen,
thank you for your comment. If words such as "may" or "probably" discourage your familiarizing with contents of this article, than feel free to ignore them. I think it's a matter of writing style rather than anything else, I hope you'll agree.
I definitely failed to provide flawless test in this article's original content, but openly admitted that in the update. I also provided fixed approach along with exhaustive explanation of why this phenomenon occured with the first microbenchmark.
Having said that, I can't resist an impression you did not bother to read the updated text or at least run the provided test. If you did, you would either support my statement about slow interface invocation or provide some constructive feedback about any possible flaws in this test that could distort the final result.
I managed to provide a real-life example (with listeners and observers). If you're looking for broader scope of usages, that's HPC and HFT. As an example - quite recent annoucement from LMAX guys about releasing Disruptor 2.0. One of the improvements they introduced with v 2.0 was reducing megamorphic calls due to their cost. To be completely clear - discouraging proper design with interfaces is the last thing I would suggest. All that I intented to illustrate is the potential performance reduction in solutions that havily employ frequent interface calls.
I hope that clears out the subject. Thank you again for your input.
Bogdan Kwiatkowski replied on Mon, 2011/10/10 - 3:16am
in response to:
Wojciech Kudla