- Published on
Consistent Hashing
- Authors
- Name
- Lucian Oprea
- @LucianDSA_
Why we need Hashing?
- Distribute data evenly among multiple nodes.
- Retrieve data efficiently without searching a whole cluster full of nodes.
The traditional way to solve these problems is through the modulus operator.
The Problem with Modulo Hashing 💣
This Modulus Hashing approach distributing the data works well, but only when number of servers remains fixed.
If we need to add or remove nodes then we get into problems.
If we try to apply a modular operation we'll get a different server indexes because the number of nodes have changed.
This means we won't be able to retrieve existing data while using the same hashing function.
A solution to this is to rehash all the existing keys, and assign them to different servers.
This is expensive because for 2 reasons:
- First, we need to recalculate the server_id for all the keys.
- Second, we need to move most of the keys to the other available nodes.
In the worst case, this we could move all the data we have in the system. This kind of operation can be costly in terms of time and hardware resources.
What is Consistent Hashing? 🛟
Consistent Hashing is an algorithm, that solves our problems when we need to dynamically add or remove servers.
Below you can see the amount of data that need to be re-distributed, in the worst case, when we need to add or remove a new node.
Algorithm | Impacted data when adding a node | |
---|---|---|
Modulo Hashing | All data | 👎 |
Consistent Hashing | All data / No. of Nodes | 👍 |
This is without considering Virtual Nodes, which can further reduce the number of impacted keys.
Let’s consider we stored 100 millions keys, and they are distributed across 5 servers. In the case of modulo distribution, when we add or remove a server, all the keys stored in the system are impacted, that is, 100 mil.
When using Consistent Hashing, only a subset of the keys are affected. In this case, the number of impacted keys, is the total number of keys (100 mil) divided by the number of servers 5 = 20 mil keys.
How Consistent Hashing works?
First, it uses a conceptual Hash Ring.
Will understand later why this algorithm uses a ring.
But now, let’s consider we have a ring with a total number hash values, N.
Then, the algorithm maps every server to a point on the circle.
Here, we can use the IP address of the servers to determine the location point.
We calculate the hash of the IP address, and the result is a point on the circle (angle).
For Node 1, a hash is calculated, and it’s value, s1, is added to the ring
And we do the same for all the servers.
You might observe that the node distribution, is not really uniform across the ring.
We’ll see later how to distribute them more evenly.
Next, we do the same for the keys.
Every key is hashed, using the same hashing algorithm, and assigned a point on the circle.
Now, how do we determine which node is responsible for a certain key?
We consider a key, and, we move in a clockwise direction, until we find the nearest server.
Thus, key1 will be allocated on server 1; k2 will be stored on server 3, and k3 and server1 as well.
So, for the given set of keys, we end up with an allocation of servers as above.
Scaling & Adding a new server
Now, let’s see what happens when we add a new node to the cluster?
The expectation is that, with this algorithm, only a fraction of the keys would need redistribution.
As described previously, we first calculate the hash of the new server’s IP, and find its location on the circle.
We see that it’s located between S3 and S1 on the circle.
After server number 4 is added, only key number 3 needs to be re-allocated.
The rest of the keys. k1, k2 remain on the same servers.
But why only key3 should be re-allocated?
If we go clockwise on the ring, from key3’s position to the nearest server, we’ll find server s4.
It’s not server s1 anymore.
However, the other keys are not redistributed based on same algorithm.
Uniform Distribution Issue 🧨
While using the consistent hashing algorithm, we might end up with irregular distribution of data.
For instance, here, server s1 handles a larger chunk of data, than the other servers.
So, this server can become a bottleneck in a distributed system, and so it can reduce the performance of the system overall.
There are actually 2 main issues that lead to this problem.
First, the space between adjacent servers is random, and it can be very small or fairly large. This is especially true, when servers are constantly added or removed.
Second, the keys are also distributed on the ring, in a non-uniform manner.
For instance, here, most of the keys are stored on server 1. However, server 2 has no keys.
Next, we’ll see what technique we can solve these problems.
Virtual Nodes
We’ll make use of Virtual nodes, as an elegant solution, to distribute the load more evenly across the nodes.
Basically, in this method, instead of applying a single hash function for a node, we’ll apply multiple hash functions, onto the same node key.
For instance, we can consider 3 hash functions instead of one.
This means that each physical node will be represented by multiple virtual nodes on the ring,
Instead of using s1, we have s1_1, s1_2, and s1_3 to represent server 1 on the ring.
Similarly, for server 2 we have the following virtual nodes on the ring. (s2_1, s2_2, and s2_3)
And so on.
Now, for the keys, we continue to use only one hash function.
Wherever the key lands onto the ring, it’s processed by the next virtual node found, while moving in the clockwise direction.
However, now, 1ach server has three positions, so the load of requests is more uniform.
This can be observed in the following table as well.
How many Virtual Nodes?
Moreover, if a certain node has more hardware capacity than others, we can add to it, more virtual nodes by using additional hash functions.
This way, it’ll have more positions in the ring, and serve more requests.
Of course, in real-world systems, the number of virtual nodes is much larger than 9.
As the number of virtual nodes increases, the distribution of keys becomes more balanced.
If the number of virtual nodes would be equal number of keys, then we would have a perfect distribution.
However, that is not really feasible, and we don’t actually need so many nodes for a good distribution.
For example, with 100 virtual nodes, we might get a difference of 10% between the nodes, when considering the number of stored keys.
However, with 200 nodes we might get a deviation of only 5% for example.
This deviation is also referred as standard deviation(sigma or σ), and it shows how dispersed the data is in relation to the mean or average.
Conclusions
In this section, we had an in-depth discussion about consistent hashing, including why it is needed and how it works.
The benefits of consistent hashing include:
- Keys are redistributed when servers are added or removed with minimal impact
- It is easy to scale horizontally because data are more evenly distributed.
- Mitigate hotspot key problem. Excessive access to a specific shard could cause server overload.
In conclusion, when using Consistent hashing we can easily add or remove servers without affecting the whole system. Here, only a fraction of the keys need to be re-arranged.
This means that we can scale up or down more easily.
Furthermore, using the technique of virtual nodes to distribute the keys more evenly across the cluster.
This way we avoid the problem of overloading certain servers.