Alex Staveley is a software professional passionate about software engineering and technical architecture. He blogs about architectural approaches, Java topics, web solutions and various technical bits and pieces. Alex is a DZone MVB and is not an employee of DZone and has posted 48 posts at DZone. You can read more from them at their website. View Full User Profile

Consistent Hashing

12.27.2011
| 8223 views |
  • submit to reddit

Consistent Hashing is a clever algorithm that is used in high volume caching architectures where scaling and availability are important. It is used in many high end web architectures for example: Amazon's Dynamo.  Let me try and explain it!


Firstly let's consider the problem.

Let's say your website sells books (sorry Amazon but you're a brilliant example). Every book has an author, a price, details such as the number of pages and an ISBN which acts as a primary key uniquely identifying each book. To improve the performance of your system you decide to cache the books.  You split the cache over four servers.  You have to decide which book to put on which server.  You want to do this using a deterministic function so you can be sure where things are.  You also want to do this at low computational cost (otherwise what's the point caching).  So you hash the book's ISBN and then mod the result by the number of servers which in our case is 4.  Let's call this number the book's hash key.

So let's say your books are:

  1. Toward the Light, A.C. Grayling (ISBN=0747592993)
  2. Aftershock, Philippe Legrain (ISBN=1408702231)
  3. The Outsider, Albert Camus (ISBN=9780141182506)
  4. This History of Western Philosophy, Bertrand Russell (ISBN=0415325056)
  5. The life you can save, Peter Singer (ISBN=0330454587)
... etc

After hashing the ISBN and moding the result by 4, let's say the resulting hash keys are:
  1. Hash(Toward the Light) % 4 = 2. Hashkey 2 means this book will be cached by Server 2.
  2. Hash(Aftershock) % 4 = 1. Hashkey 1 means this book will be cached by Server 1.
  3. Hash(The Outsider) % 4 = 4. Hashkey 1 means this book will be cached by Server 4.
  4. Hash(The History of Western Philosophy) % 4 = 1. Hashkey 1 means this book will be cached by Server 1.
  5. Hash(The Life you can save) % 4 = 3. Hashkey 1 means this book will be cached by Server 3.
Oh wow doesn't everything look so great. Anytime we have a book's ISBN we can work out its hash key and know what server its on!  Isn't that just so amazing!  Well no.  Your website has become so cool, more and more people are using it.  Reading has become so cool there are more books you need to cache.  The only thing that hasn't become so cool is your system.  Things are slowing down and you need to scale.  Vertical scaling will only get you so far; you need to scale horizontally.

Ok, so you go out and you buy another 2 servers thinking this will solve your problem.  You now have six servers.  This is where you think the pain will end but alas it won't.  Because you know have 6 servers your algorithm changes. Instead of moding by 4 you mod by 6.  What does this mean?  Initially, when you look for a book because your moding by 6 you'll end up with a different hash key for it and hence a different server to look for it on. It won't be there and you'll have incurred a database read to bring it back into the cache.  It's not just one book, it will be the for the majority of your books.  Why? Because the only time a book will be on the correct server and not need to be re-read from the database is when the hash(isbn) % 4 = hash(isbn) % 6.  Mathematically this will be the minority of your books.
 So, your attempt at scaling has put a burden on there majority of your cache to restructure itself resulting in massive database re-reads.  This can bring your system down.  Customers won't be happy with you sunshine!

We need a solution!

The solution is to come up with a system where when you add more servers and only a small minority will change books will move to new servers meaning a minimum number of database reads.  Let's go for it!

Consistent Hashing explained

Consistent hashing is an approach where the books get the same hash key irrespective of the number of books and irrespective of the number of servers - unlike our previous algorithm which mod'ed by the number of servers.  It doesn't matter if there is one server, 5 servers or 5 million servers, the books always always always always get the same hash key. 

So how exactly do we generate consistent hash values for the book
s?

Simple.  We use a similar approach to our initial approach except we stop moding on the number of servers. Instead we mod by something else, that is constant and independent of the number of servers. 

Ok, so let's hash the ISBN as before and then mod by 100.  So if you have 1,000 books.  You end up with a distribution of hash keys for the books between 0 - 100 irrespective of the number of servers.  All good. All we need is a way to figure determinstically and at low computational cost which books reside on which servers.  Otherwise again what would be the point in caching?

So here's the ultra funky part... You take something unique and constant for each server (for example its IP address) and you pass that through the exact same algorithm. This means you also end up with a hash key (in this case a number between 0 and 100) for each server.

Let's say:
  1. Server 1 gets: 12
  2. Server 2 gets: 37
  3. Server 3 gets: 54
  4. Server 4 gets: 87
Now we assign each server to be responsible for caching the books with hash keys between its own hash key and that of the next neighbour (next in the upward direction).
This means:
  1. Server 1 stores all the books with hash key between 12 and 37
  2. Server 2 stores all the books with hash key between 37 and 54
  3. Server 3 stores all the books with hash key between 54 and 87
  4. Server 4 stores all the books with hash key between 87 and 100 and 0 and 12.
If you are still with me... great. Because now we are going to scale.  We are going to add two more servers.  Lets say server 5 is added and gets the hash key 20.  And server 6 is added and gets the hash value 70.
This means:
  1. Server 1 will now only store books with hash key between 12 and 20
  2. Server 5 will stores the books with hash key between 20 and 37.
  3. Server 3 will now only store books with hash key between 54 and 70.
  4. Server 6 will stores books with the hash key between 70 and 87.
Server 2 and Server 4 are completly unaffected.

Ok so this means:
  1. All books still get the same hash key. Their hash keys are consistent.
  2. Books with hash keys between 20 and 37 and between 70 and 87 are now sought from new servers.
  3. The first time they are sought they won't be there and they will be re-read from the system and then cached in the respective servers. This is ok as long as it's only for a small amount of books.
  4. There is a small initial impact to the system but its managable.
Now, you're probably saying:
"I get all this but I'd like to see some better distribution. When you added two servers, only two servers got their load lessoned. Could you share the benefits please?"

Of course. To do that, we allocate each server a number of small ranges rather than just one large range.  So, instead of server 2 getting one large range between 37 and 54.  It gets a number of small ranges. So for example, if could get: 5 - 8, 12 - 17, 24 - 30, 43 - 49, 58 - 61, 71 - 74, 88 - 91.  Same for all servers.  The small ranges all randomly spread meaning that one server won't just have one adjacent neighbour but a collection of different neighbours for each of its small ranges.   When a new server is added it will also get a number if ranges, and number of different neighbours which means its benefits will be distributed more evenly.

Isn't that so cool!

Consistent hashing benefits aren't just limited to scaling. They also are brilliant for availability. Let's say server 2 goes offlines.  What happens is the complete opposite to what happens for when a new server is added.  Each one of Server 2 segments will become the responsibility of the server who is responsible for the preceeding segment. Again, if servers are getting a fair distribution of ranges they are responsible for it means when a server fails, the burden will be evenly distributed amongst the remaining servers.

Again the point has to emphasised, the books never have to rehashed. Their hashes are consistent.
References
  1. http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
  2. http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
  3. http://michaelnielsen.org/blog/consistent-hashing/

 

From http://dublintech.blogspot.com/2011/06/consistent-hashing.html

Published at DZone with permission of Alex Staveley, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Robert Craft replied on Thu, 2012/01/26 - 5:08am

Consistent hashing is a special kind of hashing. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped. By using consistent hashing, only K / n keys need to be remapped on average, where K is the number of keys, and n is the number of slots. Consistent hashing could play an increasingly important role as internet use increases and as distributed systems grow more prevalent.

Spring Framework

Muhammad Hakim replied on Wed, 2011/12/28 - 8:45pm

great article, thank you but I think there are any number that modulo by 4 give result 4 :) ie. xyz % 4 = 4 where xyz is non negative integer, isn't valid.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.