Bloom Filters: Saving Gigabytes When 'Probably Not' Is Good Enough
Your database stores a billion records across many SSD-backed files. A query comes in for a key that doesn't exist. To prove it's absent, the database might have to read several files from disk — slow, wasteful work to confirm a negative. What if you could ask a tiny in-memory structure "is this key definitely not here?" and skip the disk reads entirely for most misses?
That structure is a Bloom filter, and it's quietly doing exactly this inside Cassandra, RocksDB, HBase, and most LSM-tree databases right now. It's also one of the most elegant data structures in computer science: a few kilobytes of bits that lets you avoid gigabytes of pointless work, in exchange for occasionally being wrong in one specific, controllable direction.
The one trade you must understand
A Bloom filter answers set-membership with an asymmetry that is the entire point:
- If it says "not present," that is always true. No false negatives, ever.
- If it says "present," that is probably true — but might be a false positive.
So the useful question to ask a Bloom filter is the negative one. "Is this key definitely absent?" gets you a trustworthy yes, which lets you skip expensive work. "Is this key present?" gets you a maybe, which you then confirm with the real (expensive) lookup. You never get a wrong "no," and a wrong "yes" only costs you a wasted check, never a correctness bug.
Internalize that asymmetry and everything else follows.
How it works
A Bloom filter is a bit array of m bits (all zero initially) and k independent hash functions.
To add an element: hash it with all k functions, each producing an index into the array, and set those k bits to 1.
To query an element: hash it the same way and check those k bits. If any is 0, the element was definitely never added (you'd have set that bit). If all are 1, it's probably present — but those bits might all have been set by other elements that happened to collide. That coincidence is the false positive.
import hashlib
class BloomFilter:
def __init__(self, m_bits: int, k_hashes: int):
self.m = m_bits
self.k = k_hashes
self.bits = bytearray(m_bits // 8 + 1)
def _indexes(self, item: str):
h = hashlib.sha256(item.encode()).digest()
# derive k indexes from two halves of the hash (Kirsch–Mitzenmacher)
h1 = int.from_bytes(h[:16], "big")
h2 = int.from_bytes(h[16:], "big")
for i in range(self.k):
yield (h1 + i * h2) % self.m
def add(self, item: str):
for idx in self._indexes(item):
self.bits[idx // 8] |= 1 << (idx % 8)
def __contains__(self, item: str) -> bool:
return all(
self.bits[idx // 8] & (1 << (idx % 8))
for idx in self._indexes(item)
)
Notice there's no remove. You can't safely clear bits, because each bit may be shared by many elements — zeroing it would create false negatives for those others. (A counting Bloom filter, which stores small counters instead of single bits, supports deletion at the cost of more space.)
Sizing it: the math you actually use
Two parameters control everything: the number of bits m and the number of hash functions k. Get them right and you hit your target false-positive rate; get them wrong and you've either wasted memory or built a filter that says "maybe" to everything.
For n expected elements and a desired false-positive probability p:
m = -(n · ln p) / (ln 2)² # bits needed
k = (m / n) · ln 2 # optimal number of hashes
The numbers are genuinely striking. To hold one million items at a 1% false-positive rate, you need about 1.2 MB and 7 hash functions — roughly 9.6 bits per element, regardless of how big each element actually is. Storing those million items as real strings could be tens or hundreds of megabytes. That's the magic: the filter's size depends on the count of elements and your tolerance for error, not on the size of each element.
Two things people get wrong here:
- Overshooting
khurts. More hash functions set more bits per element, filling the array faster. Past the optimalk, your false-positive rate goes up, not down. Use the formula. - Overfilling silently degrades. A filter sized for
nitems behaves much worse at5n. As it saturates toward all-ones, it starts answering "maybe" to everything and stops saving you anything. If your element count can grow unboundedly, you need a scalable variant or periodic rebuilds.
Where it earns its keep
The pattern is always the same: a cheap probabilistic filter in front of an expensive definitive check.
- LSM-tree databases (Cassandra, RocksDB, HBase) keep a Bloom filter per on-disk segment. Before reading a segment to look for a key, they ask its filter "could this key be here?" A "no" skips the disk read. This single optimization is why these databases stay fast on read-heavy workloads with lots of misses.
- Caching layers use them to dodge the "cache penetration" problem — floods of requests for keys that exist in neither cache nor database. The filter rejects known-absent keys before they ever hit the backing store.
- CDNs and proxies use "have I seen this URL before?" filters to decide whether something is worth caching on first sight.
- Distributed systems ship Bloom filters across the network to summarize "the set of keys I have" compactly, so peers can avoid asking for data the other side definitely lacks.
When not to reach for one
A Bloom filter is the wrong tool if:
- You can't tolerate any false positive. If a wrong "maybe present" leads to a correctness bug rather than a harmless extra check, don't use one — or use it only as a fast-path filter with a definitive confirmation behind it.
- The set is small. For a few thousand items, a plain hash set is simpler, exact, and the memory savings are irrelevant. Bloom filters pay off at scale.
- You need to enumerate or delete. A Bloom filter can't list its members or remove them. If you need either, it's the wrong structure (or you want a counting/cuckoo variant).
The mental model worth keeping: a Bloom filter trades a small, bounded, one-directional amount of wrongness for a large, often order-of-magnitude reduction in memory and I/O. When the expensive operation is "go check the real source of truth" and most queries are misses, that trade is one of the best deals in systems engineering — and it's hiding inside far more of your infrastructure than you probably realized.
Keep reading
Backpressure: What Happens When Your System Can't Keep Up
A fast producer and a slow consumer is a recipe for an out-of-memory crash. Backpressure is the discipline of letting the slow part tell the fast part to wait. Here is how to design it in.
Database Connection Pooling: The Bottleneck You Forgot to Tune
More connections is not more throughput. Past a point, adding connections makes your database slower. Here is how pools actually work and how to size one without guessing.
Write-Ahead Logging: The Unsung Hero of Database Durability
How does a database survive a power cut mid-write without corrupting your data? The answer is a deceptively simple rule: log the change before you apply it. Here is why WAL is everywhere.
Newsletter
New posts, straight to your inbox
One email per post. No spam, no tracking pixels, unsubscribe anytime.
Comments
- No comments yet. Be the first.