How Bloom Filters Work: The Probabilistic Set
Can you check if a username is taken without querying the database? Yes. Bloom Filters are a space-efficient probabilistic data structure.
Abstract AlgorithmsTLDR: A Bloom Filter is a bit array + multiple hash functions that answers "Is X in the set?" in $O(1)$ constant space. It can return false positives (say "yes" when the answer is "no") but never false negatives (never says "no" when the answer is "yes"). Tuning array size and hash count controls the false positive rate.
๐ The Nightclub Guestlist with No Memory
Imagine a nightclub with a strict guestlist. The bouncer has a system with ten switches (0 or 1). When Alice registers, her name is hashed by three different functions: switch 2, 5, and 8 are turned ON. When Bob registers, switch 1, 5, and 9 are turned ON.
At the door: "Is Charlie here?"
- Hash Charlie โ switches 2, 7, 4.
- Switch 7 is OFF. Definitely not registered. (No false negative.)
"Is Alice here?"
- Hash Alice โ switches 2, 5, 8.
- All three are ON. Probably registered. (Might be a false positive โ someone else turned on those switches.)
This is the Bloom filter. It uses space proportional to the bit array, not to the number of items.
๐ข How Three Hash Functions Set Bits in a Bit Array
Inserting "Alice" into a 10-bit Bloom filter with 3 hash functions:
Position: 0 1 2 3 4 5 6 7 8 9
Initial: 0 0 0 0 0 0 0 0 0 0
hash1("Alice") = 2 โ set bit 2
hash2("Alice") = 5 โ set bit 5
hash3("Alice") = 8 โ set bit 8
After Alice: 0 0 1 0 0 1 0 0 1 0
Querying "Charlie":
hash1("Charlie") = 2 โ bit 2 is 1 โ
hash2("Charlie") = 7 โ bit 7 is 0 โ โ DEFINITELY NOT IN SET
Querying "Dave" (false positive case):
hash1("Dave") = 2 โ bit 2 is 1 โ (set by Alice)
hash2("Dave") = 5 โ bit 5 is 1 โ (set by Alice)
hash3("Dave") = 8 โ bit 8 is 1 โ (set by Alice)
โ All bits set โ "Probably in set" โ FALSE POSITIVE
โ๏ธ False Positives Are Guaranteed; False Negatives Are Not
Why can you never have a false negative?
When you insert an element, you set specific bits. When you query the same element, you check the exact same bits. Those bits can only be 1 (we set them). They cannot have been unset โ there's no delete operation.
Why can you have a false positive?
Multiple elements share the same bit positions. If Alice set bits 2, 5, 8 and Dave's query also maps to 2, 5, 8 โ Dave gets a false positive even though he was never inserted.
The false positive probability formula:
$$P_{\text{fp}} \approx \left(1 - e^{-kn/m}\right)^k$$
- $k$: number of hash functions
- $n$: number of elements inserted
- $m$: size of the bit array
| Action | Effect on $P_{\text{fp}}$ |
| Larger array $m$ | Lower (fewer collisions) |
| More elements $n$ | Higher (more bits set) |
| More hash functions $k$ | Non-monotonic โ there's an optimal $k = (m/n) \ln 2$ |
๐ง Sizing a Bloom Filter: Choosing $m$ and $k$
Given $n$ expected elements and desired false positive rate $p$:
$$m = -\frac{n \ln p}{(\ln 2)^2} \qquad k = \frac{m}{n} \ln 2$$
Example: 1 million elements, 1% false positives
- $m \approx 9.585 \times 10^6$ bits โ 1.2 MB
- $k \approx 7$ hash functions
Compare: storing 1 million 100-byte strings = 100 MB. The Bloom filter is 80ร smaller.
๐ Where Bloom Filters Appear in Real Systems
| System | Use case |
| Cassandra | Before a disk read, check the Bloom filter to skip SSTable files that definitely don't contain the key |
| Google Chrome | Safe Browsing โ check URLs against a Bloom filter of malicious sites before full network lookup |
| Bitcoin | SPV wallets use Bloom filters to ask nodes for transactions matching their addresses without revealing their full address list |
| HBase / LevelDB / RocksDB | Per-SSTable Bloom filters to reduce disk I/O for key lookups |
| CDN / web cache | "Have we cached this URL before?" check without expensive database lookup |
โ๏ธ Limitations: No Delete, No Count, No Exact Answer
| Limitation | Explanation |
| No deletion | Unsetting a bit might unset it for another element. (Counting Bloom filters extend this.) |
| No exact membership | Only "probably yes" or "definitely no" |
| No retrieval | Can't get the stored value โ it's not stored, only bits |
| Size is fixed | Standard Bloom filters can't resize. Scalable Bloom filters (SBF) exist as a workaround. |
๐ Key Takeaways
- A Bloom filter uses a bit array + $k$ hash functions to test set membership in $O(1)$ constant space.
- It has no false negatives and controllable false positives.
- Optimal hash count: $k = (m/n) \ln 2$. Optimal array size: $m = -n \ln p / (\ln 2)^2$.
- 1% FP rate for 1M elements costs only ~1.2 MB โ 80ร smaller than storing the elements.
- Used in Cassandra, Chrome Safe Browsing, Bitcoin SPV, and RocksDB.
๐งฉ Test Your Understanding
- A Bloom filter returns "no" for element X. Is X in the set?
- You double the bit array size $m$ while keeping $n$ constant. Does $P_{\text{fp}}$ increase or decrease?
- Why can't you delete an element from a standard Bloom filter?
- Cassandra uses a Bloom filter before reading from disk. What skip cost does this avoid?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
