Consistent Hashing: How Distributed Systems Avoid the Great Reshuffle
You run a cache cluster of four nodes. A key lands on a node via hash(key) % 4. It works beautifully until the day you add a fifth node and switch to % 5. Now almost every key maps to a different node than before. Your cache hit rate collapses to near zero, every miss hammers the database, and the database falls over. You added capacity and caused an outage.
This is the problem consistent hashing solves. The idea is small, the implementation is a hundred lines, and the failure modes are subtle enough that people who "know" consistent hashing still get bitten. Let me walk through it properly.
Why modulo hashing betrays you
hash(key) % N distributes keys evenly across N nodes, which is the part everyone notices. The part that hurts is what happens when N changes. Changing the divisor changes the result for nearly every key, because the modulo of a large number is extremely sensitive to the divisor.
Concretely, moving from 4 to 5 nodes remaps roughly 80% of keys. The fraction that stays put is only about 1/max(old, new). For a cache, a remap means a miss. For a sharded database, a remap means data is now on the wrong shard and must be physically moved. Either way, a routine scaling event becomes a thundering-herd incident.
The goal we actually want: when the node count changes by one, only about 1/N of keys should move. That is what consistent hashing delivers.
The ring
Picture a circle of hash values, say 0 to 2^32 - 1, wrapping around at the top. You place both nodes and keys on this ring using the same hash function.
- Each node is hashed (e.g.,
hash("cache-node-3")) to a point on the ring. - Each key is hashed to a point on the ring.
- A key is owned by the first node found walking clockwise from the key's position.
node A (hash 10)
•
key→ • • node B (hash 90)
|
node D • • key
(270) \ /
• node C (180)
Now add a node. It lands at one point on the ring and takes over only the arc between itself and the previous node. Every other key keeps its owner. Removing a node hands its arc to the next node clockwise, and again nothing else moves. That is the whole trick: a node is responsible for a contiguous arc, and adding or removing a node disturbs exactly one arc.
The catch: uneven distribution
A pure ring with one point per node has a problem. With only a handful of nodes hashed to random positions, the arcs are wildly uneven — one node might own 40% of the ring and another 5%. Hashing is uniform in expectation, but with five samples the variance is brutal. You bought balanced scaling and got lopsided load.
The fix is virtual nodes (sometimes called vnodes or replicas). Instead of placing each physical node once, you place it many times — say 150 points per node, each derived from hash("cache-node-3#0"), hash("cache-node-3#1"), and so on.
import hashlib
import bisect
class HashRing:
def __init__(self, nodes=None, vnodes=150):
self.vnodes = vnodes
self.ring = {} # hash -> node
self.sorted_keys = [] # sorted hash positions
for node in nodes or []:
self.add(node)
def _hash(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add(self, node: str):
for i in range(self.vnodes):
h = self._hash(f"{node}#{i}")
self.ring[h] = node
bisect.insort(self.sorted_keys, h)
def remove(self, node: str):
for i in range(self.vnodes):
h = self._hash(f"{node}#{i}")
del self.ring[h]
self.sorted_keys.remove(h)
def get(self, key: str) -> str:
if not self.ring:
return None
h = self._hash(key)
idx = bisect.bisect(self.sorted_keys, h) % len(self.sorted_keys)
return self.ring[self.sorted_keys[idx]]
With 150 vnodes per physical node, the standard deviation of load drops dramatically — each node ends up owning many small arcs scattered around the ring, and the law of large numbers smooths the distribution. The lookup is a binary search over the sorted positions, so it stays O(log V) where V is the total number of vnodes.
The number of vnodes is a real tuning knob. Too few and load is uneven; too many and your ring data structure bloats and add/remove get slow. Somewhere between 100 and 200 per node is a sane default for most systems; very large clusters sometimes go lower per node because the node count itself provides averaging.
Bounded loads: when "mostly even" is not even enough
Virtual nodes give you statistical balance, but a hot key or a slightly unlucky placement can still overload one node. If a single popular key maps to one node, vnodes do not help — that key has one home.
Google's consistent hashing with bounded loads addresses the overload case: define a capacity c = (1 + ε) · average_load, and when a key's natural owner is at capacity, walk clockwise to the next node with spare room. This caps any node at (1 + ε) times the mean while still moving only a small number of keys when the cluster changes. It is worth knowing about the day someone shows you a load graph with one node pinned at 100% and the rest idle.
For genuinely hot individual keys, the answer is upstream of the ring: replicate the hot key to multiple nodes, or add a small local cache in front. No partitioning scheme fixes a key that is simply too popular for one machine.
Where the ring quietly breaks
A few things bite teams in production:
- Different hash functions on different clients. Every client that routes to the ring must use the exact same hash function and the exact same vnode-naming scheme. One service using MD5 and another using a 32-bit truncation will route the same key to different nodes, and you will chase a phantom consistency bug for a week.
- Non-deterministic node identity. If node names include a restart timestamp or an ephemeral pod IP, every restart reshuffles the ring. Use stable identities.
- Replication and the ring. Storing
Nreplicas of a key usually means taking the nextNdistinct physical nodes clockwise — note distinct physical, not the nextNvnodes, which might all belong to the same machine. Forgetting the "distinct physical" rule silently kills your redundancy. - Membership disagreement. During a deploy, some clients see the old ring and some see the new one. For a cache this is a brief, self-healing miss storm. For a datastore it can mean reads and writes routing to different nodes — you need a membership protocol (gossip, a coordinator, or a config service) so the ring converges quickly.
When you actually need it
Consistent hashing earns its keep when you have a stateful, partitioned system where node membership changes and remapping is expensive: distributed caches (Memcached clients, the original use case), sharded datastores (DynamoDB, Cassandra, Riak all use a ring or a close cousin), and load balancers that want session affinity without sticky cookies.
If your data lives behind a single database and the cache is small enough to lose entirely on a deploy, you do not need this — % N and a cold cache might be fine. The moment a remap means moving terabytes or melting your origin, reach for the ring. It is one of those algorithms that looks like a curiosity until the day it is the only thing standing between you and a very long night.
Keep reading
Backpressure: What Happens When Your System Can't Keep Up
A fast producer and a slow consumer is a recipe for an out-of-memory crash. Backpressure is the discipline of letting the slow part tell the fast part to wait. Here is how to design it in.
Database Connection Pooling: The Bottleneck You Forgot to Tune
More connections is not more throughput. Past a point, adding connections makes your database slower. Here is how pools actually work and how to size one without guessing.
Write-Ahead Logging: The Unsung Hero of Database Durability
How does a database survive a power cut mid-write without corrupting your data? The answer is a deceptively simple rule: log the change before you apply it. Here is why WAL is everywhere.
Newsletter
New posts, straight to your inbox
One email per post. No spam, no tracking pixels, unsubscribe anytime.
Comments
- No comments yet. Be the first.