Consistent Hashing explained
The idea of Consistent Hashing arises in many context in distributed systems, so much so that I feel like it should be taught in some CS2XX course, but unfortunately we don’t have it. So here’s my attempt at explaining it, in the context of load balancing.
Load balancing
The job of load balancing is to distribute requests from user to n servers as evenly possible, as any single server has limited CPU, memory, bandwidth, etc.
Load balancing with naive hash approach
An naive approach would be:
Take a unique request identifier of user’s request r
Generate an output = hash(r)
Assuming we have n servers, take s = output % (n-1) as the server we’d like to forward user’s request to.
Visualizing the approach:
With a good hash function that maps input into output space evenly, we can be confident that this load balancing approach does the job well. But this is not used in practice, what’s the problem?
What happens if we add/remove a server?
Let’s say we had 4 user requests: r0, r1, r2, r3.
And we have:
hash(r0)=3, hash(r0)%3=0
hash(r1)=4, hash(r1)%3=1
hash(r2)=5, hash(r2)%3=2
hash(r3)=6, hash(r3)%3=3
Let’s see what happen when we have n=5 servers instead of 4 servers:
hash(r0)=3, hash(r0)%4=3
hash(r1)=4, hash(r1)%4=0
hash(r2)=5, hash(r2)%4=1
hash(r3)=6, hash(r3)%4=2
The selected server for same request will change! In case of http load balancing, If some servers decide to store user-related data in memory cache, those data would invalidated and server has to retrieve them from database again. With huge numbers of server, addition/removal of a server is exorbitant.
Load balancing with Consistent Hashing
Consistent Hashing is here to save us from the huge cost of adding/removing servers, as well as making load balancing done more evenly.
Now if you think about it, in the naive hash approach, each adjacent request index always map to a different server index, maybe that’s what incurs so much cost during addition/removal of servers? because once you change the number of servers, everything changes… Let’s take a look at how Consistent Hashing addresses this problem:
Step1: Mapping request to circular array
Now, instead of looking at a set of request, servers indices arranged in an array, let’s imagine a circular array
Step2: Mapping server to circular array
Step 3: Assign user requests to servers
After mapping both servers and user requests to the same ring, we decide that a user request should be served by a server located nearest to the user request following the clockwise direction.
So user request 5 is served by server 17. User request 21 is served by server 32. User request 43 is served by server 52
More discussion
So, this is consistent hashing! now back to the issue that makes naive hash approach unusable: how does consistent hashing avoid incurring huge cost due to adding a new server? Actually, let’s add a new server:
When new server 25 is added to the ring, user request 21 will be served by server 25 instead of server 32, while all other user requests are still served by the same servers!
There’s still one more problem, what if a particular sector of the ring has more requests(meaning the hash function is not so uniform in some places) or certain requests are more computation intensive? This will cause servers handling those requests to be overloaded.
To mitigate this problem while using same number of servers, we put virtual server nodes to a few places on the ring. For example, if we have a server whose index is s, we apply 10 hash functions to s, and get hash1(s), hash2(s), …, hash10(s) as virtual server nodes.
Then, we put these virtual server nodes on the ring, again, all user requests will be served by the nearest server in the clockwise direction.
If we do this to all servers, with number of hash functions big enough, each sector on the ring will be served by a couple of servers, further making load balancing more uniform.
Alright, that is the basic idea of Consistent Hashing, hope you enjoy reading it and be able to apply it someday(most likely in interview lol)