ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond
How to choose the right uniqueness strategy — from 6-char short URLs to distributed int64 order IDs
Abstract AlgorithmsTLDR: Short shareable IDs need Base62 (URL shorteners). Database primary keys at scale need time-ordered IDs (Snowflake, UUID v7). Security tokens need random IDs (UUID v4, NanoID). Picking the wrong strategy either causes B-tree fragmentation at 50M+ rows, production collisions, or IDs that leak your business metrics to competitors.
📖 The Uniqueness Problem Hiding Inside Every Production System
You have 500 million shortened URLs. Every ID collision is a production incident — the wrong page opens for someone clicking a shared link. How do you guarantee uniqueness at scale while keeping IDs short enough for humans to share?
This is not a niche URL shortener problem. The same decision shows up everywhere:
- User IDs in a global SaaS product must be unique across data centers with no central coordinator.
- Order IDs in a payments ledger must be time-sortable so audit queries stay fast.
- Session tokens must be unpredictable so attackers cannot guess or enumerate them.
- Distributed trace IDs must be generated at microsecond throughput with zero coordination overhead.
- Content cache keys must be deterministic — same content, same key — so deduplication is automatic.
Each use case has a different constraint set, and each constraint set has a best-fit strategy. Defaulting to "UUID" everywhere, or keeping auto-increment because it is simple, eventually causes a real problem at scale.
The five decision axes that drive every ID strategy choice are:
| Axis | What it controls | Concrete constraint example |
| Length | How many characters can the ID use? | URL shortener: ≤ 8 chars |
| Charset | Which characters are safe in the context? | URL-safe, case-insensitive, no ambiguous chars |
| Sortability | Must IDs sort by creation time? | Order IDs: yes. Session tokens: never |
| Generation model | Can a central coordinator exist? | Global distributed systems: coordinator is a bottleneck |
| Collision guarantee | Probabilistic or structural uniqueness? | Payment IDs: must be structurally collision-free |
Here is a quick decision matrix across the strategies this post covers:
| Strategy | Short IDs | Sortable | Fully Distributed | Collision-Free | URL-Safe |
| Auto-increment + Base62 | ✅ | ✅ | ❌ (needs ticket server) | ✅ | ✅ |
| UUID v4 | ❌ (36 chars) | ❌ | ✅ | ✅ (probabilistic) | ⚠️ |
| UUID v7 | ❌ (36 chars) | ✅ | ✅ | ✅ (probabilistic) | ⚠️ |
| Snowflake | ✅ (int64) | ✅ | ✅ | ✅ (structural) | ✅ |
| ULID | ⚠️ (26 chars) | ✅ | ✅ | ✅ (probabilistic) | ✅ |
| Base62 hash + Bloom filter | ✅ | ❌ | ✅ | ✅ (with check) | ✅ |
| NanoID | ⚠️ (21 chars) | ❌ | ✅ | ✅ (probabilistic) | ✅ |
🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings
"Base encoding" sounds cryptographic, but the concept is simple: you have a number, and you want to represent it using a given set of characters. The larger your character set, the fewer characters you need for the same numeric value.
Think of it like counting in different number systems. Decimal (base-10) needs more digits than hexadecimal (base-16) to represent large numbers. Base62 (62 characters) needs even fewer digits because each position carries more information.
Base16 (Hex) — the verbose universal baseline. The alphabet is 0–9, a–f — 16 characters. One hex digit represents 4 bits. A 64-bit number needs 16 hex characters. UUIDs are 128 bits → 32 hex chars (with four hyphens → 36). Verbose, but supported everywhere.
Base32 (RFC 4648) — designed for checksums and DNS. Uses A–Z, 2–7 — 32 characters. Deliberately avoids 0 and 1 (too similar to O and I). Used in TOTP secrets (Google Authenticator), AWS resource IDs, and ULID's Crockford variant. Safe for case-folding environments.
Base36 (0–9, a–z) — fully case-insensitive. 36 characters. Safe across all systems that normalize to uppercase or lowercase. 6 chars → 36^6 ≈ 2.2 billion combinations. The right choice when IDs might be spoken aloud, typed manually, or passed through case-insensitive routing layers.
Base58 — Base62 minus the four visually ambiguous characters. Removes 0 (zero), O (capital-oh), I (capital-i), and l (lowercase-L). 58 characters. Used by Bitcoin addresses and Flickr short URLs. Produces slightly longer IDs than Base62 for the same namespace, but is the correct choice for any code a human will manually type — gift cards, referral codes, license keys.
Base62 (0–9, a–z, A–Z) — the standard URL shortener alphabet. 62 characters, case-sensitive. 6 chars → 62^6 ≈ 56 billion combinations. bit.ly, TinyURL, and YouTube video IDs all use Base62 or a close variant. The tradeoff is case-sensitivity — it breaks in systems that fold case, and in some email clients that linkify URLs differently from what is intended.
| Encoding | Alphabet size | Ambiguous chars | Chars for 2^32 | Chars for 2^64 | Primary use case |
| Base16 (Hex) | 16 | None | 8 | 16 | UUIDs, SHA hashes |
| Base32 | 32 | None (by design) | 7 | 13 | TOTP secrets, AWS IDs |
| Base36 | 36 | None | 7 | 13 | Human-readable codes |
| Base58 | 58 | None (removed) | 6 | 11 | Bitcoin addresses, referral codes |
| Base62 | 62 | 0/O, I/l | 6 | 11 | URL shorteners, video IDs |
The practical takeaway: if humans will type the code, use Base58. If the ID is machine-copied (clicked links, API calls), use Base62 for the denser namespace. If any layer in your stack is case-insensitive, use Base36.
⚙️ How Each ID Strategy Generates Uniqueness at Scale
Auto-Increment + Base Encoding
The database sequence generates a monotonically increasing integer. The application encodes it in Base62 (or Base58) to produce a short, human-friendly string. Used by bit.ly and early YouTube. 6 chars of Base62 gives 56 billion unique IDs from a single counter.
The fundamental weakness is the single point of failure: if the counter database goes down, ID generation stops. Sequential IDs also leak business scale — a competitor can estimate your total URL count by probing bit.ly/1 through bit.ly/xyz123. The production fix is the Flickr Ticket Server pattern: a dedicated database pre-allocates ID ranges (e.g., app-server-1 gets 1–100,000; app-server-2 gets 100,001–200,000), allowing parallel ID generation without per-request DB coordination.
UUID v4 — Random, Globally Unique, Coordination-Free
A UUID v4 is a 128-bit random number formatted as xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx. The collision probability at 1 billion records is approximately 10⁻¹⁰ — effectively zero given the birthday problem at 2^122 scale. UUID v4 is the right choice when global uniqueness is paramount, IDs must be completely opaque, and no coordination between generators is possible.
The severe tradeoff is B-tree fragmentation at scale. UUID v4 is random, which means every insert targets a random position in the database index. At 500 million rows, almost every insert page-faults (the target leaf node is not in the buffer pool). This makes UUID v4 a serious performance hazard as a primary key for high-volume tables.
UUID v7 — Time-Ordered UUID Without the Performance Penalty
UUID v7 places a 48-bit Unix millisecond timestamp in the leading bits, followed by 74 bits of randomness. This makes UUID v7 lexicographically sortable — new records always insert near the rightmost leaf of the B-tree, preserving sequential locality. The collision probability remains negligible. RFC 9562 (2024) standardized UUID v7 and it is now the recommended UUID version for most database primary key use cases where you would otherwise use UUID v4.
Snowflake ID — 64-Bit, Time-Ordered, No Coordinator Needed
Twitter invented Snowflake in 2010 to generate IDs at millions per second across thousands of servers with zero inter-node coordination. A Snowflake ID packs three independent fields into a single 64-bit integer:
- 41 bits — Unix timestamp in milliseconds since a custom epoch (~69 years of ID space)
- 10 bits — machine/node ID (1,024 nodes supported)
- 12 bits — per-millisecond sequence counter per node (4,096 IDs per ms per node)
Across 1,024 nodes, that is roughly 4 million IDs per millisecond globally, with no shared state. IDs are naturally time-sorted, fit in a 64-bit integer (8 bytes — compact, fast to index), and are used by Twitter, Discord, Instagram, and LinkedIn in modified forms.
ULID — UUID Replacement With Built-In Sortability
ULID (Universally Unique Lexicographically Sortable Identifier) combines 48 bits of millisecond timestamp with 80 bits of randomness, encoded in Crockford Base32 to produce a 26-character string. It is URL-safe, case-insensitive, and lexicographically sortable. Designed as a drop-in UUID replacement for systems that need sortability in a human-readable string format without switching to an integer type.
NanoID — Compact, Cryptographically Random, URL-Safe
NanoID generates cryptographically secure random IDs using a configurable alphabet and length. The default output is 21 URL-safe characters (~149 bits of entropy), making it more compact and safer than UUID v4 for web-facing use. The tradeoff: NanoID is not sortable, requires an external library, and is not natively supported by database ID generation functions.
Hash-Based IDs — Deterministic Content-Addressed Keys
Take content (a URL, a file, a payload), compute SHA-256, truncate to N bytes, encode in Base62. This is deterministic: the same input always produces the same ID. Free deduplication — if you store the same URL twice, you compute the same ID and find the existing record without a lookup. Used in Git object IDs and CDN cache keys. Not suitable as a primary key without a unique constraint check — truncated hashes eventually collide as the namespace fills.
Sonyflake and KSUID — Snowflake Variants
Sonyflake is Sony's 63-bit Snowflake variant with a longer timestamp range and a simpler bit layout targeting single-datacenter deployments. KSUID (K-Sortable Unique ID) uses a 32-bit Unix timestamp in seconds (coarser than milliseconds) plus 128 bits of random payload, encoded in Base62 to produce 27 characters. KSUID trades timestamp resolution for a larger random component, making it appropriate for audit logs and event streams where second-level sorting is sufficient.
🧠 Inside Snowflake: Why a 64-Bit Integer Achieves Global Uniqueness Without a Central Database
The Internals: Bit Layout, Node Assignment, and Clock Skew
Snowflake's uniqueness guarantee comes from combining three orthogonal sources of differentiation into a single int64. The bit layout:
Bits: 63 22 12 0
| 41-bit | 10-bit | 12-bit |
| timestamp| node ID | sequence |
| (ms) | (0–1023) | (0–4095) |
Within each millisecond, the sequence counter increments per ID generated on a given node. When the sequence counter overflows 4,095 within one millisecond, the generator waits for the clock to tick forward. Node IDs are assigned at service startup — typically via ZooKeeper, Kubernetes pod ordinals, or a startup-time registration with a coordination service.
The critical vulnerability is clock skew. If a server's clock moves backward (NTP correction, leap second, VM migration), and the Snowflake generator does not detect it, two different IDs generated at the "same" timestamp on the same node could collide if the sequence counter has reset. Production Snowflake implementations detect clock regression and refuse to issue IDs until the clock catches up — they block rather than risk a duplicate.
Performance Analysis: Why Snowflake and UUID v7 Outperform UUID v4 at 500M Rows
The B-tree index underpins primary key lookups in every major relational database. A B-tree is a sorted structure. Inserting a random UUID v4 means the database must find the correct leaf node for that value — which could be anywhere in the tree. At scale, nearly every insert page-faults because the target leaf node is cold in the buffer pool.
Snowflake IDs and UUID v7 are monotonically increasing within millisecond granularity. Every new record inserts at the rightmost leaf of the B-tree. That page is hot. Inserts become sequential writes — far more efficient on both spinning disks and NVMe SSDs.
| Metric | UUID v4 | Snowflake / UUID v7 |
| Insert pattern | Random B-tree page (anywhere in tree) | Append to rightmost leaf |
| Buffer pool locality | Low — most inserts are cold misses | High — rightmost leaf always hot |
| Index fragmentation after 500M rows | Heavy — many half-filled pages | Low — compact sequential structure |
| Typical insert throughput (500M rows, PostgreSQL) | ~20–30K/s | ~80–100K/s |
| ID storage size | 16 bytes (UUID) | 8 bytes (int64 Snowflake) |
This is not a theoretical concern. Percona, Uber Engineering, and PlanetScale have all published post-mortems on UUID v4 index fragmentation causing production degradation. It was one of the primary motivations behind UUID v7's standardization.
📊 Visualizing the URL Shortener Write Path with Bloom Filter Guard
When a user submits a long URL, the system must generate a short ID, verify it has not been used before, and atomically store the mapping. A Bloom filter acts as a probabilistic pre-check that avoids a database roundtrip for the vast majority of new IDs — if the filter says "definitely not seen before," skip the DB check. If it says "possibly seen," fall back to the database.
The diagram below shows the write path for a Base62 counter-based URL shortener. The Bloom filter provides an O(1) probabilistic "seen before" check. The database unique constraint is the authoritative safety net that catches any genuine collision the Bloom filter missed.
flowchart TD
A[User submits long URL] --> B[App Server]
B --> C{Bloom Filter: \nID seen before?}
C -->|Definitely not seen| D[Encode counter → Base62 short ID]
C -->|Possibly seen — collision risk| E[Increment counter, get next ID]
D --> F{DB unique constraint: \nID already exists?}
F -->|ID is unique| G[Write mapping to DB]
F -->|Collision confirmed| E
G --> H[Add new ID to Bloom Filter]
G --> I[Return short URL to user]
E --> D
The Bloom filter eliminates database roundtrips for approximately 99.9% of new ID checks. The unique constraint in the database eliminates the rare genuine collision. Together they provide both throughput efficiency and correctness. Note: after a cold Bloom filter restart (e.g., new deployment), all IDs look "new" — the DB constraint remains the ground truth in all cases.
🌍 The URL Shortener Deep Dive — Making 500M Short Links Work Without a Single Collision
URL shorteners are the canonical ID strategy interview problem because every constraint is concrete and measurable. Let us walk through the decision in order.
Requirement baseline: 500M URLs total. 100:1 read:write ratio. IDs ≤ 8 characters. URL-safe. Zero collisions acceptable.
Step 1 — Namespace sizing. At 6-char Base62: 62^6 = 56.8 billion possible IDs. With 500M records, namespace saturation is only 0.9% — comfortable. But the birthday paradox matters: P(collision) ≈ n² / (2 × namespace). At 500M records with 6-char Base62, this is roughly 0.22% — unacceptably high for a production system. The correct choice is 7-char Base62 (3.5 trillion IDs), dropping collision probability to under 0.0036% at 500M records.
Step 2 — Counter vs hash. A counter produces predictable IDs, which is fine for URLs (they are meant to be public). A hash produces random IDs, making the system harder to enumerate but requiring a collision check on every write. For a URL shortener without sensitive ID requirements, a counter with Bloom filter is simpler and safer.
Step 3 — Multi-master counter coordination. A single database counter is a single point of failure. The Flickr Ticket Server pattern solves this: each app server pre-allocates a range of counter values (app-server-1 holds 1–100,000 in memory). When it exhausts its range, it requests a new block atomically. This eliminates per-request DB coordination while maintaining global uniqueness.
Step 4 — Base choice. URL shortener IDs are machine-copied — users click links, they do not type them. Base62 is correct over Base58: 62^7 = 3.5 trillion vs 58^7 = 2.2 trillion. More IDs per character. Case-sensitivity is acceptable because HTTP clients preserve URL case.
Step 5 — Read path caching. At 100:1 read:write ratio, read performance dominates. Short-ID-to-URL mappings live in a Redis cluster. 500M URLs × 200 bytes per entry ≈ 100GB — fits in a Redis cluster with hot IDs served from L1 memory. Cache miss triggers a DB lookup and a cache warm.
| Decision | Option A | Option B | Chosen |
| ID length | 6 chars (56.8B namespace) | 7 chars (3.5T namespace) | 7 chars |
| Generation | Hash-based (random) | Counter + ticket server | Counter + ticket server |
| Base | Base58 (human-safe) | Base62 (denser namespace) | Base62 |
| Collision check | DB only (slower) | Bloom filter + DB | Bloom filter + DB |
⚖️ Random vs Time-Ordered: The Tradeoff That Breaks B-Tree Indexes in Production
The most common production ID mistake is using UUID v4 as a primary key and discovering the performance cliff at 50–100M rows — by which point a schema migration is painful.
Why random UUIDs hurt B-tree indexes. A B-tree is a sorted structure. Every UUID v4 insert targets a random page anywhere in the tree. At 100M rows with 16KB pages, the working set of "hot" pages is a tiny fraction of the total tree. Nearly every insert page-faults. The database reads a cold page from disk, modifies it, and writes it back. This is the definition of write amplification on random I/O.
Why time-ordered IDs fix it. If every new ID is larger than all existing IDs (monotonically increasing within the same millisecond), every insert appends to the rightmost page. That page is almost always in the buffer pool. Inserts become near-sequential writes — dramatically more efficient and far kinder to SSD wear patterns.
The privacy/security tradeoff. Time-ordered IDs reveal when a record was created and allow inference about creation rate. For internal order IDs or audit log entries, this is completely acceptable. For externally-visible user IDs, it is a mild privacy concern. For session tokens or API keys, it is a security risk — never use time-ordered IDs for anything that must be unguessable.
The two-ID pattern. Generate an internal Snowflake ID (fast inserts, sortable for queries) and a separate external random token (UUID v4 or NanoID, opaque to callers) for user-facing identifiers. Stripe uses this pattern: every charge has an internal ledger ID (sortable, used for ordering) and a public charge ID (ch_xxx, opaque, used in APIs and webhooks). GitHub uses it for repository IDs. Twilio uses it for every resource.
| Scenario | UUID v4 | Snowflake or UUID v7 |
| DB insert throughput at 500M rows | Degrades ~70% vs baseline | Stays near baseline |
| Time-sortable queries | ❌ | ✅ |
| Reveals creation time | No | Yes |
| Safe for session tokens | ✅ | ❌ — never use time-ordered tokens |
| Safe for order / audit IDs | ❌ — fragmented index | ✅ |
| ID storage overhead | 16 bytes | 8 bytes (Snowflake) |
🧭 Choosing Your ID Strategy: Decision Flowchart and Use-Case Matrix
The right strategy is determined by your constraints. Start from length, then sortability, then coordination model. This flowchart covers 95% of real-world decisions.
flowchart TD
A[What are your constraints?] --> B{Need short\nhuman-shareable IDs?\n≤8 chars}
B -->|Yes| C{Sortability\nrequired?}
C -->|Yes| D[ULID or KSUID\n26–27 chars sortable]
C -->|No| E[Base62 counter\n+ Bloom filter]
B -->|No — internal or longer OK| F{Need time-sorted IDs\nfor DB performance?}
F -->|Yes| G{Multi-node\ndistributed?}
G -->|Yes| H[Snowflake ID\nor UUID v7]
G -->|No — single DB| I[UUID v7\nor auto-increment]
F -->|No sort needed| J{Content\ndeduplication?}
J -->|Yes| K[Truncated SHA-256\n→ Base62]
J -->|No| L{Security token?\nMust be unguessable}
L -->|Yes| M[NanoID or UUID v4]
L -->|No| N[UUID v4\nis sufficient]
When two strategies appear equal on the flowchart, prefer the one your database or ORM natively supports — operational familiarity reduces incident blast radius.
Use-case decision table:
| Use Case | Recommended Strategy | Core Reason |
| URL shortener (public) | Base62 counter + Bloom filter | Short, URL-safe, collision-controlled with Bloom guard |
| User IDs (global SaaS) | UUID v7 or Snowflake | Sortable for queries, no central coordinator required |
| Order IDs (internal payments) | Snowflake | Compact int64, time-ordered for ledger scans, high throughput |
| Session tokens / API keys | NanoID or UUID v4 | Unpredictable, URL-safe, no ordering leak |
| Content-addressed cache keys | SHA-256 truncated → Base62 | Deterministic, free deduplication, idempotent writes |
| Referral / gift card codes | Base58 | Removes 0/O/I/l confusion; zero mistype errors |
| Audit / event log IDs | ULID | Time-ordered, readable, URL-safe, case-insensitive |
| Distributed trace IDs | UUID v4 or Snowflake | Zero coordination needed, generated at microsecond scale |
🧪 Sizing Your Namespace — The Birthday Paradox Applied to ID Collision Risk
The birthday paradox reveals a counterintuitive truth: you do not need to fill a namespace before collisions become likely. You need roughly √(namespace_size) records before the collision probability exceeds 50%.
The approximation formula is: P(collision) ≈ n² / (2 × namespace_size)
Where n is the total number of IDs generated.
Example: Base62 Namespace Collision Probability by Record Count
| ID Length | Namespace Size | At 10M records | At 100M records | At 500M records | At 1B records |
| 6 chars (Base62) | 56.8B | 0.00088% | 0.088% | 2.2% | 8.8% |
| 7 chars (Base62) | 3.5T | 0.0000143% | 0.00143% | 0.036% | 0.143% |
| 8 chars (Base62) | 218T | 0.00000023% | 0.000023% | 0.00057% | 0.0023% |
The 6-char Base62 row reveals why the URL shortener must use 7 characters. At 500M records, a 6-char namespace has a 2.2% collision probability — unacceptable for a production system where every collision is a wrong redirect.
Sizing rule: if your expected record count is more than √(namespace_size / 2), add one character to your ID length. For 500M records, √(56.8B / 2) ≈ 169K — 500M is far beyond this threshold for a 6-char Base62 namespace. Move to 7 chars where √(3.5T / 2) ≈ 1.3M — still comfortably above 500M, which is why the Bloom filter + unique constraint defense-in-depth approach remains mandatory.
A Bloom filter pre-check on the write path reduces collision detection DB roundtrips by ~99.9%, but it is not the safety net — the DB unique constraint is. Size your namespace first, then add the Bloom filter, then enforce the unique constraint. These are three independent layers of correctness, not alternatives to each other.
🛠️ hashids: Human-Friendly ID Obfuscation for Database-Backed Systems
hashids is an open-source library (available in 30+ languages) that encodes integers into short shuffled strings using a custom alphabet and a salt. It is not a cryptographic primitive — it is an obfuscation tool that prevents casual enumeration of sequential database IDs.
Unlike plain Base62 encoding, hashids applies a permutation based on your salt so that sequential integers (1, 2, 3) do not produce sequential strings. This hides business scale: a competitor examining https://paste.example.com/W7eT cannot easily determine that this corresponds to database row 12,345,678 or infer the total number of pastes.
// hashids concept — library usage pattern only, not a full application
Hashids hashids = new Hashids("my-salt", 6, "abcdefghijklmnopqrstuvwxyz0123456789");
String encoded = hashids.encode(12345678L); // → "k4k8k" (example output)
long[] decoded = hashids.decode("k4k8k"); // → [12345678]
hashids is deterministic (same input + salt + alphabet → same output) and reversible without a database lookup. It is a good fit for platforms that expose sequential database IDs externally and want basic obfuscation without maintaining a separate mapping table. Pastebin-style platforms, YouTube-style video IDs, and short URL generators use this pattern.
The key caveat: hashids is reversible by anyone who knows your salt and alphabet. It is obfuscation, not security. Never use it for tokens that must be unguessable.
For a companion deep-dive on content-addressed storage and deterministic ID generation strategies, a follow-up post is planned in this series.
🛠️ Redis INCR + Base62 — The Ticket Server Pattern for Atomic Short ID Generation
Redis provides an atomic INCR command that increments a counter key and returns the new value in a single operation. No application-level locking is needed — Redis's single-threaded command execution guarantees that no two clients ever receive the same counter value.
INCR url_counter # Returns: 12345678
# App layer: encode 12345678 in Base62 → "W7eT"
# Store mapping: "W7eT" → "https://example.com/long-url"
For multi-region systems, run one Redis counter per region with a regional prefix or a large offset in the counter start (region-A starts at 0, region-B starts at 10¹², region-C starts at 2×10¹²). This eliminates cross-region coordination while preserving global uniqueness.
For durable ID generation, persist the counter (appendonly yes in Redis config) or use a PostgreSQL sequence (NEXTVAL). A Redis instance that restarts without persistence resets the counter to 0, causing ID reuse — a subtle and dangerous production bug. The ticket server pattern with pre-allocated ranges in application memory (each server holds a block of 10,000 IDs) adds a buffer against counter service downtime without compromising uniqueness.
For an architectural deep-dive on Redis as a coordination primitive, see the planned Redis Internals post in the How It Works: Internals Explained series.
📚 What Real Systems Got Wrong With ID Generation
Never expose auto-increment DB IDs externally. Sequential IDs reveal your total record count and creation rate. In 2012, a researcher enumerated Facebook profile IDs by incrementing from 1 and derived total user count data before a financial disclosure. Use an opaque encoding (hashids, Base62 shuffle, or UUID v4) for any ID that crosses the API boundary.
Always size your namespace 100× larger than your maximum expected record count. The birthday paradox means collision probability rises sharply once you pass √(namespace_size) records. Planning for 500M URLs means designing for a 50-billion-plus namespace — 7-char Base62 — not just "will it fit."
Time-ordered IDs dramatically outperform random UUIDs for database inserts. If you have an existing system with UUID v4 primary keys and inserts are degrading, migrating to UUID v7 is the lowest-friction fix — same 16-byte wire format, same column type, no schema migration, just a UUID version change. Percona reports 3–5× insert throughput improvement at 500M+ rows.
For human-entered codes, Base58 is mandatory, not optional. Confusing 0 with O or I with l causes measurable support incidents. E-commerce platforms that shipped gift card and referral code systems using Base62 without filtering ambiguous characters documented 2–5% redemption failure rates attributable to mistyping alone.
A Bloom filter pre-check is a throughput optimization, not the collision safety net. Bloom filters can only say "definitely not seen" or "possibly seen." After a cold start, the filter is empty and every ID looks new. The database unique constraint is the authoritative check. Treat the Bloom filter as a performance layer, not a correctness layer.
The two-ID pattern is the production-grade solution for IDs that are both fast and safe. Internal Snowflake or UUID v7 ID for database performance. Separate external random token (NanoID, UUID v4) for anything user-facing or security-sensitive. This is how Stripe, GitHub, and Twilio design their resource identifiers.
📌 ID Strategy Decision Cheat Sheet
- Short shareable URLs: Base62 counter + Bloom filter + unique constraint. Use 7 chars for 500M+ records — 6 chars gives 2%+ collision probability at that scale.
- Database primary keys at high volume: UUID v7 or Snowflake. UUID v4 causes B-tree fragmentation at 50M+ rows — this is a documented production problem, not a theoretical concern.
- Global distributed IDs, no coordinator: Snowflake (compact int64, sortable, 4M IDs/ms across 1,024 nodes) or ULID (26-char string, sortable, URL-safe, case-insensitive).
- Security tokens and session IDs: UUID v4 or NanoID — random, unpredictable, URL-safe. Never use time-ordered IDs for anything that must be unguessable.
- Human-entered codes: Base58. Removing the four ambiguous characters (
0,O,I,l) is the difference between a 0% and a 3% support ticket rate. - Content deduplication: Truncated SHA-256 in Base62. Deterministic, free deduplication. Still requires a DB unique check — hash collisions are rare but real.
- The two-ID pattern: Snowflake internally, random token externally. This is what Stripe, GitHub, and Twilio do, and it solves the entire DB performance vs opacity tradeoff.
📝 Test Your ID Strategy Thinking
Why does bit.ly use Base62 over Base58 for its short IDs?
- A) Base62 is easier to implement in most programming languages
- B) Base62 provides more IDs per character count, and short URLs are machine-copied rather than human-typed
- C) Base58 is case-sensitive and would cause URL parsing errors in some browsers
- D) Base62 is required by the HTTP specification for URL path segments Correct Answer: B — Base62's 62 characters give 56.8B IDs at 6 chars vs Base58's 38.1B. Since users click short URLs (not type them), the ambiguous-character risk is irrelevant.
A distributed payment system must generate over 1 billion order IDs per day, and those IDs must be sortable for real-time ledger queries. Which strategy fits best?
- A) UUID v4 — collision-free and completely distributed with no coordination
- B) Base62 hash of order details — deterministic with free deduplication
- C) Snowflake ID — time-ordered int64, no inter-node coordination, ~4M IDs/ms globally
- D) PostgreSQL auto-increment — simple, sequential, and fast for single-node systems Correct Answer: C — Snowflake provides time-ordered int64 IDs at massive throughput with no coordinator. UUID v4 would cause B-tree fragmentation at 1B+ rows. Auto-increment is a single point of failure for a distributed system.
A service has UUID v4 primary keys. Insert throughput has dropped 70% compared to 100M rows ago (now at 500M rows). What is the fix and why does it work?
- A) Add a secondary index on the UUID column to speed up lookups
- B) Switch to UUID v7 — the leading 48-bit timestamp makes new records insert at the rightmost B-tree leaf, eliminating random page faults
- C) Increase database buffer pool RAM to cache more random B-tree pages
- D) Partition the table by UUID prefix to distribute random inserts across smaller subtrees Correct Answer: B — UUID v7's time-ordered prefix restores sequential insert behavior. Options A, C, and D treat symptoms rather than the root cause (random page access pattern).
Your referral code system shows a 4% redemption failure rate. Users call support saying the code "doesn't work." Codes are 8 characters using Base62. What is the most likely cause and the correct fix?
- A) 8-char Base62 has insufficient namespace — users are guessing valid codes
- B) Base62 includes visually ambiguous characters (0/O and I/l) — users misread them; switch to Base58
- C) The Bloom filter false positive rate is too high — tune the filter bit-array size
- D) The codes are too long — reduce to 6 characters to reduce transcription errors Correct Answer: B — The classic Base62 support ticket problem. Removing 0, O, I, l via Base58 eliminates ambiguous-character mistyping at the source.
Open-ended challenge: At what approximate record count does a 7-character Base62 namespace become risky for collision without a uniqueness check? Design a system generating 10 billion short URLs over 5 years while keeping collision probability below 0.01%. Consider namespace sizing, write-path architecture, and what happens when the Bloom filter is cold after a deployment restart.
🔗 Keep Learning: Related Posts
- How Bloom Filters Work: The Probabilistic Set — the data structure powering the collision pre-check on every URL shortener write path
- System Design HLD: URL Shortener — the full HLD walkthrough that applies the ID strategy, caching, and sharding decisions from this post
- Consistent Hashing Explained — the complementary routing strategy that determines which shard holds a given ID
- System Design Sharding Strategy — how your ID strategy and shard key selection interact at database scale

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions — with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy — but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose — range, hash, consistent hashing, or directory — determines whether range queries stay ch...
