Performance Zone is brought to you in partnership with:

I am a software developer from Poland, currently working in banking industry. For the past few years I have been writing software in Java, however I actively seek for a close alternative. Certified in SCJP, SCJD, SCWCD and SCBCD, used to be active on StackOverflow. I feel comfortable at the back-end, however recently rediscovered front-end development. In spare time I love cycling. Tomasz is a DZone MVB and is not an employee of DZone and has posted 85 posts at DZone. You can read more from them at their website. View Full User Profile

HashMap Performance Improvements in Java 8

04.23.2014
| 8328 views |
  • submit to reddit

HashMap<K, V> is fast, versatile and ubiquitous data structure in every Java program. First some basics. As you probably know, it uses hashCode() and equals() method of keys to split values between buckets. The number of buckets (bins) should be slightly higher than the number of entries in a map, so that each bucket holds only few (preferably one) value. When looking up by key, we very quickly determine bucket (using hashCode()modulo number_of_buckets) and our item is available at constant time.

This should have already been known to you. You probably also know that hash collisions have disastrous impact on HashMap performance. When multiple hashCode() values end up in the same bucket, values are placed in an ad-hoc linked list. In worst case, when all keys are mapped to the same bucket, thus degenerating hash map to linked list - from O(1) to O(n) lookup time. Let's first benchmark how HashMap behaves under normal circumstances in Java 7 (1.7.0_40) and Java 8 (1.8.0-b132). To have full control overhashCode() behaviour we define our custom Key class:

class Key implements Comparable<Key> {
private final int value;
Key(int value) {
this.value = value;
}
@Override
public int compareTo(Key o) {
return Integer.compare(this.value, o.value);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass())
return false;
Key key = (Key) o;
return value == key.value;
}
@Override
public int hashCode() {
return value;
}
}

Key class is well-behaving: it overrides equals() and provides decent hashCode(). To avoid excessive GC I cache immutable Key instances rather than creating them from scratch over and over:

public class Keys {
public static final int MAX_KEY = 10_000_000;
private static final Key[] KEYS_CACHE = new Key[MAX_KEY];
static {
for (int i = 0; i < MAX_KEY; ++i) {
KEYS_CACHE[i] = new Key(i);
}
}
public static Key of(int value) {
return KEYS_CACHE[value];
}
}

Now we are ready to experiment a little bit. Our benchmark will simply create HashMaps of different sizes (powers of 10, from 1 to 1 million) using continuous key space. In the benchmark itself we will lookup values by key and measure how long it takes, depending on the HashMap size:

import com.google.caliper.Param;
import com.google.caliper.Runner;
import com.google.caliper.SimpleBenchmark;
public class MapBenchmark extends SimpleBenchmark {
private HashMap<Key, Integer> map;
@Param
private int mapSize;
@Override
protected void setUp() throws Exception {
map = new HashMap<>(mapSize);
for (int i = 0; i < mapSize; ++i) {
map.put(Keys.of(i), i);
}
}
public void timeMapGet(int reps) {
for (int i = 0; i < reps; i++) {
map.get(Keys.of(i % mapSize));
}
}
}

The results confirm that HashMap.get() is indeed O(1):

Interestingly Java 8 is on average 20% faster than Java 7 in simple HashMap.get(). The overall performance is equally interesting: even with one million entries in a HashMap a single lookup taken less than 10 nanoseconds, which means around 20 CPU cycles on my machine*. Pretty impressive! But that's not what we were about to benchmark. 

Suppose that we have a very poor map key that always returns the same value. This is the worst case scenario that defeats the purpose of using HashMap altogether:

class Key implements Comparable<Key> {
//...
@Override
public int hashCode() {
return 0;
}
}

I used the exact same benchmark to see how it behaves for various map sizes (notice it's a log-log scale):

Results for Java 7 are to be expected. The cost of HashMap.get() grows proportionally to the size of the HashMap itself. Since all entries are in the same bucket in one huge linked list, looking up one requires traversing half of such list (of size n) on average. Thus O(n) complexity as visualized on the graph.

But Java 8 performs so much better! It's a log scale so we are actually talking about several orders of magnitude better. The same benchmark executed on JDK 8 yields O(logn) worst case performance in case of catastrophic hash collisions, as pictured better if JDK 8 is visualized alone on a log-linear scale:

What is the reason behind such a tremendous performance improvement, even in terms of big-O notation? Well, this optimization is described in JEP-180. Basically when a bucket becomes too big (currently: TREEIFY_THRESHOLD = 8), HashMap dynamically replaces it with an ad-hoc implementation of tree map. This way rather than having pessimistic O(n) we get much better O(logn). How does it work? Well, previously entries with conflicting keys were simply appended to linked list, which later had to be traversed. Now HashMappromotes list into binary tree, using hash code as a branching variable. If two hashes are different but ended up in the same bucket, one is considered bigger and goes to the right. If hashes are equal (as in our case), HashMap hopes that the keys are Comparable, so that it can establish some order. This is not a requirement of HashMap keys, but apparently a good practice. If keys are not comparable, don't expect any performance improvements in case of heavy hash collisions.

Why is all of this so important? Malicious software, aware of hashing algorithm we use, might craft couple of thousand requests that will result in massive hash collisions. Repeatedly accessing such keys will significantly impact server performance, effectively resulting in denial-of-service attack. In JDK 8 an amazing jump from O(n) to O(logn) will prevent such attack vector, also making performance a little bit more predictive. I hope this will finally convince your boss to upgrade. 

*Benchmarks executed on Intel Core i7-3635QM @ 2.4 GHz, 8 GiB of RAM and SSD drive, running on 64-bit Windows 8.1 and default JVM settings.

Published at DZone with permission of Tomasz Nurkiewicz, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Matthew Jaggard replied on Wed, 2014/04/30 - 8:53am

This doesn't make any sense. You return zero from your hashCode() so the tree structure would also be just a list, not a proper tree and you'd still have to do equals() on half of the objects on average.

So, given that the results don't follow your reasoning, what is the reason or did you just make up the results?!!

Arthur Walker replied on Wed, 2014/04/30 - 10:08am in response to: Matthew Jaggard

For the collision test, hashCode is made to always return 0 but the values of the keys remain different, and that's what Comparable checks when ordering the Keys in the collision tree.  Therefore, the reasoning still stands.  If the keys are comparable, the tree used for the hashCode collisions can get better performance if the keys are comparable and different.

Matthew Jaggard replied on Wed, 2014/04/30 - 10:46am in response to: Arthur Walker

Ah, I see. So HashMap now requires any Key implementing Comparable to be compatible with hashCode and equals. That seems like something that should be documented quite explicitly somewhere prominant!!

Tomasz Nurkiewicz replied on Wed, 2014/04/30 - 3:11pm in response to: Matthew Jaggard

No, HashMap does not require keys to be Comparable. Otherwise it would have been major backward incompatibility.

If keys aren't comparable and we get significant hash collisions, it would just behave as bad as it used to prior to Java 8 (or even worse). But if they are Comparable, we get the benefit of using trees rather than linked lists. So it's a good practice, not a requirement.

Narendra Shah replied on Sun, 2014/05/04 - 12:29am

Very nice Comparison of JDK 7 AND JDK 8 hashmap.

Pierre-yves Saumont replied on Mon, 2014/05/05 - 3:26am

You writes that "The number of buckets (bins) should be slightly higher than the number of entries in a map, so that each bucket holds only few (preferably one) value." There are two problems with this affirmation.

First, buckets do not hold values, but entries, which are key/value pairs (in the form of Map.Entry). If it would store only values, there would be no mean to handle get() when it leads to a bucket containing several values.

Second, there is no relation between the number of buckets and the fact that one "value" only is stored in each bucket. There will be only one key/value pair in each bucket if all keys have different hash codes. The initial number of buckets is determined at creation time and it will increase if a given percentage of them is holding at least one key/value pair. This percentage is the load factor.

One thing that is worth mentioning is that collection view of the map have no specific order, but all version of Java until Java 7 give the same order for a non rehashed map. Due to changes in the implementation, Java 8 gives a different order. So, some programs that appeared to be working in Java 1.7 while relying upon map order will break in Java 8.


Matthew Jaggard replied on Mon, 2014/05/05 - 11:31am in response to: Tomasz Nurkiewicz

Tomasz, I didn't say that they had to be comparable. I said that the implementation of comparable has to be compatible with equals/hashcode. So in previous code I've had entities where equals and hashcode are based on the id but comparable is based on the value of an invoice (for example) - previously I could check if the entity was a key in a hashmap without populating anything except the id. Now I'd have to load the whole entity.

Tomasz Nurkiewicz replied on Mon, 2014/05/05 - 12:33pm in response to: Pierre-yves Saumont

Pierre, everything you say is obviously right. It was a mental shortcut from my side, thanks for clarification. With regards to key/value pair distribution - it's even worse. Keys with different hash codes may still land in the same bucket, even if otherwise they could be easily distributed.


Also I agree with your statement about order of entries in a map. But unspecified order in HashMap is so strongly emphasized that I don't believe it will impact reasonably written software. On the other hand I saw unit tests breaking after migration from Java 6 to 7 because order of entries in HashMap changed - so it's not really the first time.

Tomasz Nurkiewicz replied on Mon, 2014/05/05 - 12:39pm in response to: Matthew Jaggard

Hi Matthew, good point, thanks for clearing this. However note that it is (and always was (?)) strongly recommended that Comparable is consistent with equals:

The natural ordering for a class C is said to be consistent with equals if and only if e1.compareTo(e2) == 0 has the same boolean value as e1.equals(e2) for every e1 and e2 of class C

From: http://docs.oracle.com/javase/7/docs/api/java/lang/Comparable.html


Comparator interface is less strict with that regard.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.