Probabilistic Data Structures Explained: Bloom Filters, HyperLogLog, and Count-Min Sketch
How Google validates usernames in milliseconds without querying a database — and the family of approximate data structures that make it possible.
Abstract AlgorithmsTLDR: Probabilistic data structures — Bloom Filters, Count-Min Sketch, HyperLogLog, and Cuckoo Filters — trade a small, bounded probability of being wrong for orders-of-magnitude better memory efficiency and O(1) speed. Bloom filters answer "definitely not in this set" in constant time; HyperLogLog counts 2^64 unique elements in 12KB; Count-Min Sketch estimates event frequencies with sub-1% error in a fixed-size array. Google uses a Bloom filter to pre-screen username availability before touching a database at all.
📖 Why Exact Data Structures Break at Scale
When you type a username into Google's sign-up form, a response arrives within 50 milliseconds telling you whether that username is already taken. Google has roughly three billion active accounts. A round-trip database query — even against a perfectly indexed table — cannot reliably complete in 50ms at global scale when contending with network latency, connection pooling, cache misses on cold rows, and sheer query volume from millions of sign-up attempts every hour.
So Google does not query the database first.
Before the database ever sees the request, a Bloom filter (a compact, in-memory bit array) checks whether the username is in the taken-names set. If the filter says "definitely not taken," the username is returned as available instantly — zero database queries, zero disk reads, sub-millisecond response. Only when the filter says "possibly taken" does the server bother querying the database to confirm. That single pre-filter eliminates more than 99% of database round-trips for username availability checks.
This is not a trick unique to Google. The same class of data structure — called probabilistic data structures — powers some of the most performance-critical systems in the industry:
- Apache Cassandra uses a Bloom filter per SSTable to skip disk reads. Before scanning a file on disk, Cassandra checks the filter. "Definitely not in this file" → the disk read never happens.
- Chrome Safe Browsing stores hundreds of millions of known-malicious URLs in a Bloom filter embedded in the browser itself. Your browser checks locally before sending any outbound query to Google.
- Redis ships HyperLogLog — a structure that counts how many unique elements you have seen in a stream using 12KB of memory, regardless of how many billions of distinct values have been observed.
- Twitter uses Count-Min Sketch to track hashtag frequencies across millions of tweets per minute in a fixed-size in-memory counter grid.
What makes all of these possible is the same insight: most real-world questions don't require an exact answer.
- "Is this username taken?" only needs: "definitely no" (safe to skip the DB) or "probably yes" (check the DB to be sure).
- "How often did this hashtag appear in the last five minutes?" needs a close estimate — within 1% is fine for showing trending topics.
- "How many unique visitors did this page get today?" needs an approximate count — within 1% is fine for analytics.
The trade is always the same: you accept a small, bounded false positive rate (the probability of the structure incorrectly saying "yes" when the answer is "no") in exchange for orders-of-magnitude better memory use and constant-time operations. False negatives (incorrectly saying "no" when the answer is "yes") are mathematically impossible in Bloom filters and Count-Min Sketch, which makes their error asymmetry safe to exploit. We will return to each of these terms with precise definitions as we encounter each structure.
This post covers the full probabilistic data structure family. For a focused deep-dive on Bloom filter internals alone — bit array sizing formulas, optimal hash function count, and FPR tuning — see How Bloom Filters Work.
🔍 How Bloom Filters Answer "Definitely Not" in O(1)
A Bloom filter is a bit array of size m, initialized entirely to zero, combined with k independent hash functions. Every element inserted into the filter is run through all k hash functions, and the resulting k bit positions are set to 1. To check membership, an element is hashed through the same k functions; if every one of those bit positions is already 1, the filter says "probably present." If any of those positions is 0, the filter says "definitely not present."
The asymmetry here is the entire point. A false negative (saying "definitely not present" when the element was actually inserted) is impossible — because inserting an element sets those bits to 1 permanently, and bits can only go from 0 to 1, never back. A false positive (saying "probably present" when the element was never inserted) is possible when other elements' hash collisions happen to cover the same bit positions.
The Google Username Availability Check — Step by Step
Here is how the username validation works end-to-end, without a database query for the happy path:
- User types
john_smith_99in the sign-up form. - Browser sends an async HTTP request to the availability-check endpoint.
- Server hashes
john_smith_99through k hash functions, producing bit positions:[47, 203, 891, 1042]. - Server checks those four bit positions in the in-memory Bloom filter bit array.
- If any bit is 0: the username is definitely not taken. Response: "available" — returned in under 1ms. Zero database queries.
- If all bits are 1: the username might be taken. The server issues a database query to confirm.
- The database query either confirms the username is taken (true positive) or reveals it was a false positive (bits coincidentally set by other usernames).
- Response: total latency 5–50ms for the "might be taken" path, versus 200–500ms if every check went to the database directly.
The diagram below shows this flow. Read it top to bottom: the filter acts as a gate, deflecting the vast majority of requests before they reach the database at all.
flowchart TD
A([User types username in sign-up form]) --> B[Server hashes username through k hash functions]
B --> C{Check k bit positions in Bloom filter bit array}
C -->|Any bit is 0| D[Definitely NOT taken\nReturn 'available' immediately\nNo DB query]
C -->|All bits are 1| E[Possibly taken\nIssue database lookup]
E --> F{Database confirms?}
F -->|Yes — username exists| G[Username taken\nReturn 'unavailable']
F -->|No — false positive| H[Username available\nReturn 'available']
D --> Z([Response delivered in < 1ms])
G --> Z
H --> Z
The Bloom filter here acts as a first-line pre-screener. Roughly 99%+ of new username attempts are genuinely novel — they hash to at least one zero bit, are rejected by the filter, and never reach the database. Only the small fraction of attempts that are either genuinely taken or unlucky enough to collide with existing entries require a DB query. This is how a system can be globally responsive at scale without over-provisioning database capacity.
False positive rate. The approximate probability of a false positive is (1 − e^(−kn/m))^k, where n is the number of items inserted, m is the bit array size in bits, and k is the number of hash functions. In practice:
| Target FPR | Bits per element | Memory for 1B elements |
| 1% | ~10 bits | ~1.2 GB |
| 0.1% | ~15 bits | ~1.8 GB |
| 0.01% | ~20 bits | ~2.4 GB |
Compare this to storing 1 billion strings in a HashSet: with 64-bit hashes and pointer overhead, you are looking at 8–16 bytes per element, or roughly 8–16 GB. A Bloom filter achieves the same membership check in 1.2 GB with 1% FPR — a 6–13× memory saving.
What Bloom filters cannot do. They cannot delete elements: setting bits back to 0 might clear bits shared by other inserted elements, creating false negatives. They cannot tell you what is in the set — only whether a specific element is likely present. They cannot count how many times an element was inserted. When deletion is required, the Cuckoo Filter (covered in the Deep Dive section) is the correct replacement.
Production deployments of Bloom filters.
- Apache Cassandra and HBase: Each SSTable file on disk has an associated Bloom filter. On a read, Cassandra checks the filter before opening the file. "Definitely not in this SSTable" → the disk I/O never happens. In read-heavy workloads this dramatically reduces I/O amplification.
- Chrome Safe Browsing: Google distributes a Bloom filter containing ~600 million known-malicious URLs to every Chrome browser. The browser performs a local filter check before sending any network request to Google's Safe Browsing API — keeping most browsing private and latency near zero.
- Apache HBase, Elasticsearch, ScyllaDB: All apply Bloom filters to storage file indexes for the same SSTable-skip optimization.
- Ethereum: Every block header contains a Bloom filter of all log entries and addresses touched in that block, allowing light clients to quickly check whether a specific event occurred without downloading the full block data.
⚙️ Count-Min Sketch: Tracking Hashtag Frequency at Twitter Scale
Consider Twitter's trending topics feature. At peak traffic, tens of millions of tweets arrive per minute. For each tweet, you want to increment a counter for every hashtag in it and keep a running top-K list of the most frequent hashtags in the last five minutes. An exact approach — a HashMap of hashtag → count — sounds straightforward until you realize there are millions of distinct hashtags active at any moment. Keeping exact counts for all of them requires unbounded memory that grows with every new hashtag ever tweeted, and updating millions of HashMap entries per second creates contention hotspots.
A Count-Min Sketch (CMS) solves this in a fixed amount of memory with bounded error.
Structure. A CMS is a 2D counter array with d rows and w columns, alongside d independent hash functions (one per row). The entire array is initialized to zero. Its size is fixed at creation time and never grows.
Insert (increment). For each row i from 1 to d, hash the element to a column position j = h_i(element) mod w, then increment counter[i][j] by 1 (or by the event weight). Each element increments exactly d counters — one per row.
Query (estimate frequency). For each row i, hash the element to j = h_i(element) mod w and read counter[i][j]. Return the minimum value across all d rows.
The minimum is the core insight. Counters can only be over-estimated — multiple elements that collide in the same column all increment the same counter, inflating it. But a counter can never be under-estimated because we only increment, never decrement. Taking the minimum across all rows exploits the fact that, while any individual row might have collisions inflating a counter, it is unlikely that every row's counter for a given element is inflated at the same time. The minimum gives the tightest available estimate.
The diagram below shows a 3-row × 6-column sketch. Inserting the element #kubernetes hashes to column 2 in row 1, column 5 in row 2, and column 1 in row 3 — three counters are incremented. Querying reads those same three cells and returns the minimum.
flowchart LR
subgraph Insert_kubernetes ["Insert #kubernetes"]
direction TB
H1["h1(kubernetes) → col 2"] --> R1["Row 1, Col 2: +1"]
H2["h2(kubernetes) → col 5"] --> R2["Row 2, Col 5: +1"]
H3["h3(kubernetes) → col 1"] --> R3["Row 3, Col 1: +1"]
end
subgraph Query_kubernetes ["Query #kubernetes"]
direction TB
Q1["Row 1, Col 2: read value"] --> MIN
Q2["Row 2, Col 5: read value"] --> MIN
Q3["Row 3, Col 1: read value"] --> MIN
MIN["min(v1, v2, v3) = estimated count"]
end
Insert_kubernetes --> Query_kubernetes
Each row hashes the element to a different column, distributing collision risk independently across rows. The minimum across all rows is the estimated frequency — always an over-estimate of the true count, never an under-estimate.
Accuracy guarantees. With d = log(1/δ) rows and w = e/ε columns, the estimated count is within ε × n of the true count with probability at least 1 − δ, where n is the total number of insertions. In practice: a sketch with w = 2000 columns and d = 7 rows uses roughly 112KB of memory (2000 × 7 × 8 bytes) and delivers less than 1% error for hashtag frequency estimates across a stream of tens of millions of events. A HashMap tracking exact counts for five million active hashtags would require several hundred MB to gigabytes of memory with significantly higher write contention.
Where Count-Min Sketch is deployed.
- Twitter/X trending topics: CMS tracks hashtag occurrence frequency in sliding five-minute windows. The top-K hashtags come from approximate counts, not exact tallies. Exact counting at this scale would require hundreds of GB of memory across the analytics tier.
- Akamai CDN: edge servers identify "heavy hitter" URLs — the pages generating disproportionate traffic — using Count-Min Sketch in megabyte-scale memory per edge node. Heavy hitters are then pre-cached aggressively.
- Apache Flink and Kafka Streams: both offer built-in sketching operators (
TopKvia approximate frequency) for streaming analytics pipelines. - Network intrusion detection: high-throughput packet-processing systems use CMS to identify source IP addresses sending anomalous volumes of packets — a canonical "heavy hitter" problem — without storing a counter for every distinct IP seen.
🧠 Deep Dive: HyperLogLog and Cuckoo Filter Internals
HyperLogLog Internals: Counting Unique Visitors in 12KB
The cardinality problem. How many unique users visited this page today? How many distinct source IP addresses connected to this service in the last hour? How many unique product IDs appeared in this order stream? These are all cardinality estimation questions — counting the number of distinct elements in a multiset.
An exact answer requires storing every element seen: a HashSet of all visitor IDs. For 100 million daily unique visitors with 64-bit IDs, that is 800MB of memory minimum, not counting HashSet overhead. For a system running this query across thousands of pages simultaneously, exact cardinality tracking becomes prohibitively expensive.
The leading-zeros intuition. HyperLogLog exploits a statistical property of hash functions: if you hash each element to a uniformly random bit string, the probability that the hash starts with at least k leading zeros is 1/2^k. Intuitively, if the longest run of leading zeros you have ever observed is k = 20, you have probably seen around 2^20 ≈ one million distinct elements — because you had to see roughly that many before one of them produced a hash starting with twenty zeros.
In practice, a single leading-zeros counter has enormous variance (one lucky hash can wildly overestimate). HyperLogLog's contribution over earlier LogLog algorithms — hence "Hyper" — is to use multiple independent sub-streams (called registers) by splitting the hash into a routing prefix and a value. The cardinality estimate is computed as a harmonic mean across all registers, which dramatically reduces variance.
Space efficiency. Redis ships HyperLogLog as a first-class data type using exactly 12KB of memory (4096 registers × 6 bits per register) for any cardinality up to 2^64. The standard error rate is ±0.81%. Compare this to exact cardinality:
| Method | Memory for 100M unique IDs | Error |
| HashSet (exact) | ~800 MB | 0% |
| HyperLogLog | 12 KB | ±0.81% |
| Space ratio | 66,000× smaller | Bounded |
For analytics — page views, unique visitors, A/B experiment reach — a ±0.81% error is wholly acceptable. The 66,000× memory saving is not.
HyperLogLog also supports merging: you can compute separate HyperLogLog sketches for different time windows or data partitions and merge them into a combined estimate. This is what makes distributed cardinality estimation practical: shards compute local HyperLogLogs, and the coordinator merges them for a global estimate with the same ±0.81% error.
Cuckoo Filter: Fingerprint-Based Membership with Deletion Support
The deletion problem with Bloom filters. Every bit position in a Bloom filter may be shared by multiple inserted elements. If you set a bit back to 0 to "delete" an element, you might unset a bit that another element relies on — creating a false negative. This means Bloom filters are append-only. You can only add elements; you can never remove them.
For use cases where elements expire — a rate limiter tracking recent requests, a URL cache with TTL, a recently-seen deduplication window — you need deletion support without sacrificing the "definitely not present" guarantee.
A Cuckoo Filter solves this by storing short fingerprints (truncated hashes of elements, typically 8–16 bits) in a hash table rather than individual bits in a bit array. Each element has exactly two candidate bucket positions: h1(x) and h2(x). The fingerprint must reside in one of these two buckets.
Insert. Compute the fingerprint f = fingerprint(x) and the two candidate buckets h1(x) and h2(x). Place f in h1(x) if it has an empty slot. If h1(x) is full, displace the existing occupant (the "cuckoo" eviction, named after the cuckoo bird's nest-takeover behavior) and try to place it in its other candidate bucket. This can cascade, but converges with high probability.
Delete. Look up the fingerprint in h1(x) and h2(x) and remove it. Since fingerprints are stored explicitly (not merged into a shared bit array), deletion is safe — removing a fingerprint does not affect any other element's membership record.
Lookup. Check both candidate buckets for the fingerprint. If found, return "probably present." If neither bucket contains it, return "definitely not present."
Performance Analysis: Space-Error Trade-offs Across the Full Family
The table below summarizes the four structures across the key operational dimensions:
| Structure | Query type | False negatives | False positives | Deletion | Space at 1% FPR / 1% error | Best for |
| Bloom Filter | Membership | Never | Yes (tunable) | ❌ No | ~10 bits/element | Set membership, no deletes |
| Cuckoo Filter | Membership | Never | Yes (tunable) | ✅ Yes | ~12 bits/element | Set membership with expiry |
| Count-Min Sketch | Frequency count | Never (over-estimates) | N/A | ❌ No | Fixed: rows × cols × 8B | Frequency estimation in streams |
| HyperLogLog | Cardinality | N/A | N/A | ❌ No | 12KB fixed (Redis) | Unique element counting |
At low false positive rates (below ~3%), Bloom filters are slightly more space-efficient than Cuckoo filters. Above 3% FPR, Cuckoo filters become more space-efficient because their fingerprint-based encoding has lower per-element overhead than a Bloom filter's finer-grained bit array at that error tolerance.
📊 The Google Username Validation Flow: From Keystroke to Response
The following diagram traces the complete end-to-end flow from the user typing a username to the response being displayed — showing exactly where the Bloom filter intercepts the request, what happens on the happy path (no DB query), and what happens when the filter produces a "possibly taken" signal. The sequence also illustrates the Cassandra analogy: every node in a distributed read system runs the same filter-first logic at the storage layer.
flowchart TD
A([Keystroke: user types 'john_smith_99']) --> B[Async HTTP request to /check-username]
B --> C[Validation Service]
C --> D{Bloom Filter pre-check\nin-memory bit array}
D -->|Any of k bits = 0\nDefinitely NOT taken| E[Return 'available'\nLatency < 1ms\nNo DB query issued]
D -->|All k bits = 1\nPossibly taken| F[Issue DB query\nSELECT COUNT FROM users\nWHERE username = ?]
F --> G{Row found in DB?}
G -->|Yes — username exists| H[Return 'unavailable'\nLatency 50–200ms]
G -->|No — false positive| I[Return 'available'\nLatency 50–200ms\nBloom false positive resolved]
style E fill:#22c55e,color:#fff
style H fill:#ef4444,color:#fff
style I fill:#22c55e,color:#fff
style D fill:#3b82f6,color:#fff
The critical takeaway from this diagram: the Bloom filter diverts the vast majority of traffic at the top of the funnel — the left branch (< 1ms, no DB) handles perhaps 98–99% of new username attempts. The database only sees genuinely ambiguous cases, which means it handles a tiny fraction of the total query volume. This is the structural reason Google can serve username availability checks in tens of milliseconds at global scale without over-provisioning database capacity proportional to sign-up traffic.
🌍 Real-World Applications Across the Industry
Probabilistic structures appear in every layer of production infrastructure — from browser security to distributed storage to streaming analytics. The table below catalogs confirmed deployments by company and use case.
| Company / System | Use Case | Structure | Scale |
| Google Sign-Up | Username availability pre-check before DB query | Bloom Filter | ~3B usernames in the set |
| Apache Cassandra | SSTable skip before disk read on every node | Bloom Filter | One filter per SSTable, per node |
| Google Chrome | Malicious URL check on-device (Safe Browsing) | Bloom Filter | ~600M URLs distributed to each browser |
| Apache HBase | Region skip before HFile scan | Bloom Filter | Per HFile per region |
| Ethereum | Log event presence check in block headers | Bloom Filter | Per block in the chain |
| Redis (built-in) | Unique visitor / cardinality counting | HyperLogLog | Any cardinality up to 2^64, 12KB |
| Google Analytics, Mixpanel | Unique user/event counting per page | HyperLogLog | Billions of events, approximate counts |
| Apache Spark, Flink | approxCountDistinct() operator | HyperLogLog | Distributed, merges across partitions |
| Twitter/X | Trending hashtag frequency in 5-min windows | Count-Min Sketch | Millions of hashtags per hour |
| Akamai CDN | Heavy-hitter URL identification on edge nodes | Count-Min Sketch | Megabyte-scale per edge node |
| Kafka Streams | Approximate top-K frequency streaming analytics | Count-Min Sketch | Continuous streams |
| URL caches with TTL | Membership with expiry support | Cuckoo Filter | Any membership set requiring deletes |
Across this list, the common thread is scale combined with acceptable approximation. Every one of these systems has verified that the bounded error rate of the probabilistic structure is well within the tolerance of its use case — and that the memory and latency savings are orders of magnitude beyond what an exact structure could provide.
⚖️ Trade-offs and Failure Modes of Probabilistic Structures
Understanding when these structures fail is as important as knowing when to use them.
False positives are tunable but non-zero. Every Bloom and Cuckoo filter has a false positive rate determined by its size relative to the number of elements inserted. If you insert significantly more elements than the filter was designed for, the FPR rises sharply. A Bloom filter configured for 1 million elements at 1% FPR that receives 10 million insertions may rise to 70%+ FPR — at which point it is worse than useless. Mitigation: measure insertion volume and re-provision or rotate filters accordingly.
Count-Min Sketch only over-estimates. The minimum-across-rows guarantee means frequency estimates are always ≥ the true count. For tail items — elements that appear very rarely in a stream — a small sketch may assign them inflated counts due to collisions with high-frequency items. Mitigation: use a wider sketch (more columns) to reduce collision probability for tail items, or pair with an exact counter for items above a certain frequency threshold.
HyperLogLog cannot be reduced in cardinality. Once you have added elements to a HyperLogLog, you cannot remove them. If you need a sliding-window unique visitor count (e.g., unique users in the last 24 hours), you must maintain separate HyperLogLogs per time bucket and merge them — not modify a single shared one. Mitigation: bucket per time period (hourly or daily HyperLogLogs), merge for aggregates, discard expired buckets.
Bloom filters are write-once. No deletion, no updates. If you need to rotate which elements are "in the set" (e.g., a recent-events deduplication window), you must either use a Cuckoo filter or maintain two Bloom filters (an active and a staged one, swapping periodically — the double-buffering pattern). Mitigation: use Cuckoo filters when deletion is required; use double-buffering for Bloom filters with TTL semantics.
Cuckoo insert can fail. In rare cases, the cuckoo eviction chain cycles without finding an empty slot (fill factor exceeds ~95%). The insert fails and must fall back to an exact structure or trigger a resize. Mitigation: keep fill factor below 90% and monitor insertion failure rates.
| Failure Mode | Structure | Symptoms | Mitigation |
| FPR blowout from over-insertion | Bloom / Cuckoo | False positives spike; too many DB fallbacks | Size for peak + headroom; rotate filters |
| Tail item count inflation | Count-Min Sketch | Low-frequency items over-counted | Wider sketch (more columns) |
| No sliding-window support | HyperLogLog | Stale cardinality counts | Per-bucket sketches; merge for aggregates |
| Deletion not possible | Bloom Filter | Cannot expire set membership | Switch to Cuckoo Filter or double-buffering |
| Insert failure at high load | Cuckoo Filter | Occasional insert rejection | Keep fill < 90%; monitor failure rate |
🧭 Decision Guide: Choosing the Right Probabilistic Structure
The flowchart below provides a decision path based on the type of query you need to answer. Walk through it from the top: start with what kind of answer you need, then factor in whether deletion matters, and you arrive at the correct structure.
flowchart TD
A([What do you need to answer?]) --> B{Is this a membership query?\n'Is X in this set?'}
B -->|No| C{Do you need to count\nhow often X occurs?}
B -->|Yes| D{Do elements ever\nneed to be deleted?}
C -->|Yes — frequency estimation| E[Count-Min Sketch\nFixed 2D counter array\nMin-across-rows estimate]
C -->|No — unique element count| F[HyperLogLog\n12KB fixed memory\n±0.81% error at any scale]
D -->|No — append-only set| G{Is FPR < 3%\nor memory critical?}
D -->|Yes — elements expire or\nare removed| H[Cuckoo Filter\nFingerprint table\nDeletion supported]
G -->|Yes| I[Bloom Filter\nMost space-efficient\nat low FPR]
G -->|No — FPR > 3% acceptable| J[Cuckoo Filter\nBetter space efficiency\nat FPR above 3%]
style E fill:#8b5cf6,color:#fff
style F fill:#0ea5e9,color:#fff
style H fill:#f59e0b,color:#fff
style I fill:#22c55e,color:#fff
style J fill:#f59e0b,color:#fff
Walk through this flowchart by asking one question at a time. If you need to answer "is X in this set?" at all, you are in the left branch — then the only remaining question is whether elements ever leave the set (choose Cuckoo Filter) or whether the set is append-only (choose Bloom or Cuckoo based on FPR target). If you are not doing membership queries at all, you need either frequency counting (Count-Min Sketch) or cardinality counting (HyperLogLog).
A consolidated decision table:
| Question to answer | Use this structure | Key reason |
| "Is X in this set?" — append-only, FPR < 3% | Bloom Filter | Most space-efficient at low FPR |
| "Is X in this set?" — elements can be deleted | Cuckoo Filter | Only set-membership structure with safe deletion |
| "Is X in this set?" — FPR > 3% acceptable | Cuckoo Filter | Better space efficiency at higher error tolerances |
| "How often has X occurred in this stream?" | Count-Min Sketch | Bounded over-estimation, fixed memory |
| "How many unique elements have I seen?" | HyperLogLog | 12KB for any cardinality, ±0.81% error |
🧪 The Cuckoo Filter in Practice: URL Deduplication with Expiry
To make the Cuckoo Filter concrete, consider a web crawler that needs to avoid revisiting URLs it has already processed in the last 24 hours. A naive HashSet of visited URL strings is unbounded — a large crawl may process hundreds of millions of URLs, and keeping them all in memory is impractical. A Bloom filter would work for the "have I seen this URL?" check, but it cannot expire old URLs: once a URL is inserted, its bits stay set permanently, causing the filter to grow stale and over-report "already visited" for URLs that have aged out of the 24-hour window.
The Cuckoo Filter fits perfectly here:
- Insert a URL when first crawled:
cuckoo.insert("https://example.com/page-42")— stores a 12-bit fingerprint in one of two candidate buckets. - Check before recrawling:
cuckoo.lookup("https://example.com/page-42")— returns true (probably visited) or false (definitely not visited). - Expire a URL after 24 hours:
cuckoo.delete("https://example.com/page-42")— finds and removes the fingerprint. The slot is now free; the filter no longer "knows" about this URL. - After deletion, the URL is treated as unvisited: the next
lookupreturns false, and the crawler re-queues it correctly.
This "insert on visit, delete after TTL, check before crawl" pattern is impossible with a Bloom filter. It works cleanly with a Cuckoo filter because deletion leaves no residue — no shared bits to corrupt, no false negatives introduced.
The same pattern applies to rate limiters tracking recent request keys, session deduplication windows, and recently-seen message deduplication in event pipelines. Any use case that phrases as "is this item in the active window?" — where items enter and exit the window — is a Cuckoo Filter use case.
🛠️ Redis: Probabilistic Structures as First-Class Data Types
Redis ships three probabilistic structures as native data types, making it the easiest path to production-grade approximate data without implementing these algorithms from scratch. Two are available through the RedisBloom module (included in Redis Stack); one is built into Redis core.
Bloom Filter via RedisBloom:
BF.RESERVE usernames 0.001 1000000
BF.ADD usernames "alice@example.com"
BF.EXISTS usernames "alice@example.com" → 1 (probably exists)
BF.EXISTS usernames "newuser@example.com" → 0 (definitely not exists)
BF.RESERVE creates a Bloom filter with the specified false positive rate (0.001 = 0.1%) and expected element capacity (1,000,000). BF.ADD inserts an element. BF.EXISTS performs the membership check — returning 1 for "probably present" or 0 for "definitely not present."
HyperLogLog (built into Redis core):
PFADD page:views:2024-01-15 user_123 user_456 user_789
PFCOUNT page:views:2024-01-15
→ (integer) 3
PFMERGE users:january users:2024-01-01 users:2024-01-02
PFCOUNT users:january
→ (integer) 4729183
PFADD registers elements into a HyperLogLog (the PF prefix honours Philippe Flajolet, the algorithm's inventor). PFCOUNT returns the approximate unique count with ±0.81% error. PFMERGE combines multiple HyperLogLogs into one — enabling per-day sketches to be merged into a monthly unique visitor count without re-scanning the raw data.
Count-Min Sketch via RedisBloom:
CMS.INITBYDIM trending 2000 7
CMS.INCRBY trending "kubernetes" 1
CMS.INCRBY trending "docker" 5
CMS.QUERY trending "kubernetes"
→ (integer) 1
CMS.QUERY trending "docker"
→ (integer) 5
CMS.INITBYDIM creates a Count-Min Sketch with explicit dimensions (width=2000 counters, depth=7 rows). CMS.INCRBY increments the frequency count for an element. CMS.QUERY returns the estimated frequency (always ≥ true count due to the over-estimation property).
For a full deep-dive on Redis as a caching and data structure platform, see the companion Redis Sorted Sets Explained post.
📚 Lessons Learned
Exact data structures don't scale to membership, frequency, or cardinality queries at billion-element scale. A HashSet storing every taken username needs terabytes; a Bloom filter storing the same set needs gigabytes. At some point, the exact approach requires hardware that the approximate approach avoids entirely.
The "definitely not" guarantee is worth everything. Bloom filters cannot produce false negatives — once an element is inserted, it will always be found. This asymmetric error profile is what makes them safe to use as pre-screeners: the consequence of a false positive is one unnecessary database query; the consequence of a false negative would be a correctness violation.
False positive rate is tunable and should be sized for the use case. A 1% FPR means roughly one in a hundred membership queries triggers an unnecessary DB lookup. For Google's sign-up flow, that is an acceptable cost. For Cassandra's SSTable check, where a false positive means an unnecessary disk read, a lower FPR (and slightly more memory) may be worth the trade.
Count-Min Sketch always over-estimates, never under-estimates. This is safe for use cases like trending topics (slightly inflated counts are fine) but may be inappropriate where under-counting would be preferable (e.g., billing — you would rather over-charge than under-charge).
HyperLogLog's 12KB-for-any-cardinality is not magic — it is a bounded error guarantee. The ±0.81% error rate is a statistical property, not a guarantee on any individual count. For very small cardinalities (fewer than ~1000 elements), HyperLogLog may have higher relative error; Redis uses a dense representation that improves small-set accuracy.
Cuckoo filters are the right default when deletion will be needed. The implementation complexity over a Bloom filter is modest, and the deletion support avoids an entire class of operational problems (stale set membership, unbounded filter growth) that append-only Bloom filters require workarounds for.
All of these structures can be deployed without custom implementations. Redis, Apache Cassandra, Apache Spark, and Flink ship them as standard building blocks. Use the production-tested implementations; the subtle correctness properties (especially cuckoo eviction termination, HyperLogLog harmonic mean bias correction, and CMS hash independence) are non-trivial to get right in a handwritten implementation.
📌 Summary & Key Takeaways
Probabilistic data structures are the answer to a class of problems that exact data structures handle poorly at scale: membership queries over billions of elements, frequency estimation over unbounded streams, and cardinality counting without storing every element seen. The trade is always a small, mathematically bounded error rate in exchange for dramatic reductions in memory use and constant-time operations.
| Structure | Core guarantee | Cannot do | Sweet spot |
| Bloom Filter | "Definitely not in this set" — zero false negatives | Delete, count, retrieve | Large append-only sets |
| Cuckoo Filter | "Definitely not" + deletion support | Count, retrieve | Sets where items expire |
| Count-Min Sketch | Frequency estimate ≥ true count | Exact counts, membership | Streaming frequency analytics |
| HyperLogLog | Cardinality ± 0.81% in 12KB | Exact counts, membership, list items | Unique visitor / event counting |
Google's username validation flow illustrates the decisive value: the Bloom filter eliminates 99%+ of unnecessary database queries by providing a "definitely not taken" pre-screen in under one millisecond. The database is consulted only when the filter returns "possibly taken," and even that signal is wrong some small percentage of the time (false positives) — which is fine, because the consequence is just one extra DB query, not a correctness failure.
Apply these structures when you recognize the following patterns:
- Set membership on a large corpus → Bloom Filter (append-only) or Cuckoo Filter (deletions needed).
- Frequency counting in a high-throughput stream → Count-Min Sketch.
- Unique element counting across large or distributed datasets → HyperLogLog.
📝 Practice Quiz
Test your understanding of probabilistic data structures. Questions 1–4 have a correct answer; question 5 is open-ended.
Q1. Why can a Bloom filter never produce a false negative?
- A) Because it uses multiple hash functions
- B) Because bits in the array can only be set to 1, never back to 0
- C) Because it stores the actual elements in sorted order
- D) Because it always queries the database as a fallback
✅ Correct Answer: B — Inserting an element sets k bit positions to 1. Because bits never revert to 0, those positions will always be 1 on a subsequent lookup — making it impossible for the filter to return "definitely not present" for an item that was actually inserted.
Q2. You're building a trending topics feature that processes 1 million tweets per minute. Which structure do you use to count hashtag frequency?
- A) HashSet — exact counts are necessary for trending
- B) HyperLogLog — it counts unique occurrences
- C) Bloom Filter — it checks if a hashtag has been seen
- D) Count-Min Sketch — bounded over-estimation of frequency in a fixed-size array
✅ Correct Answer: D — Count-Min Sketch is designed for frequency estimation in high-throughput streams. It uses a fixed 2D counter array regardless of how many distinct hashtags appear. HyperLogLog counts unique elements (cardinality), not frequency per element.
Q3. Chrome's Safe Browsing Bloom filter stores 600 million URLs at 0.1% FPR. Using ~15 bits/element, approximately how much memory does it require?
600,000,000 × 15 bits / 8 ≈ 1.1 GB
✅ Correct Answer: ~1.1 GB — Storing all 600M URLs as exact 64-bit hashes would need ~4.8 GB. The Bloom filter delivers the same membership check at ~1.1 GB, distributed to every Chrome browser and checked locally without any network round-trip.
Q4. What happens if you try to delete an element from a Bloom filter?
- A) The element's bits are set back to 0, successfully removing it
- B) A false negative is created — the element still appears to be present
- C) Setting bits back to 0 may clear bits shared by other elements, creating false negatives for those elements
- D) Nothing — Bloom filters support deletion natively
✅ Correct Answer: C — Bits are shared: multiple inserted elements may set the same bit position. Clearing bits for one element could clear a bit that is also part of another element's fingerprint, causing a false negative for that other element. This is why Bloom filters are append-only. Use a Cuckoo Filter when deletion is required.
Q5. (Open-ended) Google's username validation returns "available" in under 5ms and "probably taken" in ~200ms. Explain the full architecture that produces this latency difference, and describe two failure modes that could degrade performance even with a Bloom filter in place.
No single correct answer — challenge yourself to reason through the architecture. Consider: What causes the two latency profiles? What happens to FPR as more usernames are added without resizing the filter? What happens if the Bloom filter is evicted from memory and must be reloaded from disk? How do you update the filter when usernames are deleted (accounts closed)?
🔗 Related Posts
- How Bloom Filters Work — Deep dive into Bloom filter internals: bit array sizing, optimal hash function count, and FPR tuning formulas
- How Kafka Works — Kafka's log compaction and segment indexing: another system where probabilistic structures reduce I/O
- Consistent Hashing Explained — The probabilistic technique for distributing data across nodes without catastrophic remapping
- Hash Tables Explained — The exact-structure foundation underlying all of the probabilistic structures in this post

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work. Sealed classes killed unchecked inheritance. Each...
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy — but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose — range, hash, consistent hashing, or directory — determines whether range queries stay ch...
Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning
TLDR: Partitioning splits one logical table into smaller physical pieces called partitions. The database planner skips irrelevant partitions entirely — turning a 30-second full-table scan into a 200ms single-partition read. Range partitioning is best...
