All Posts

ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond

How to choose the right uniqueness strategy — from 6-char short URLs to distributed int64 order IDs

Abstract AlgorithmsAbstract Algorithms
··26 min read
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: Short shareable IDs need Base62 (URL shorteners). Database primary keys at scale need time-ordered IDs (Snowflake, UUID v7). Security tokens need random IDs (UUID v4, NanoID). Picking the wrong strategy either causes B-tree fragmentation at 50M+ rows, production collisions, or IDs that leak your business metrics to competitors.

📖 The Uniqueness Problem Hiding Inside Every Production System

You have 500 million shortened URLs. Every ID collision is a production incident — the wrong page opens for someone clicking a shared link. How do you guarantee uniqueness at scale while keeping IDs short enough for humans to share?

This is not a niche URL shortener problem. The same decision shows up everywhere:

  • User IDs in a global SaaS product must be unique across data centers with no central coordinator.
  • Order IDs in a payments ledger must be time-sortable so audit queries stay fast.
  • Session tokens must be unpredictable so attackers cannot guess or enumerate them.
  • Distributed trace IDs must be generated at microsecond throughput with zero coordination overhead.
  • Content cache keys must be deterministic — same content, same key — so deduplication is automatic.

Each use case has a different constraint set, and each constraint set has a best-fit strategy. Defaulting to "UUID" everywhere, or keeping auto-increment because it is simple, eventually causes a real problem at scale.

The five decision axes that drive every ID strategy choice are:

AxisWhat it controlsConcrete constraint example
LengthHow many characters can the ID use?URL shortener: ≤ 8 chars
CharsetWhich characters are safe in the context?URL-safe, case-insensitive, no ambiguous chars
SortabilityMust IDs sort by creation time?Order IDs: yes. Session tokens: never
Generation modelCan a central coordinator exist?Global distributed systems: coordinator is a bottleneck
Collision guaranteeProbabilistic or structural uniqueness?Payment IDs: must be structurally collision-free

Here is a quick decision matrix across the strategies this post covers:

StrategyShort IDsSortableFully DistributedCollision-FreeURL-Safe
Auto-increment + Base62❌ (needs ticket server)
UUID v4❌ (36 chars)✅ (probabilistic)⚠️
UUID v7❌ (36 chars)✅ (probabilistic)⚠️
Snowflake✅ (int64)✅ (structural)
ULID⚠️ (26 chars)✅ (probabilistic)
Base62 hash + Bloom filter✅ (with check)
NanoID⚠️ (21 chars)✅ (probabilistic)

🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings

"Base encoding" sounds cryptographic, but the concept is simple: you have a number, and you want to represent it using a given set of characters. The larger your character set, the fewer characters you need for the same numeric value.

Think of it like counting in different number systems. Decimal (base-10) needs more digits than hexadecimal (base-16) to represent large numbers. Base62 (62 characters) needs even fewer digits because each position carries more information.

Base16 (Hex) — the verbose universal baseline. The alphabet is 0–9, a–f — 16 characters. One hex digit represents 4 bits. A 64-bit number needs 16 hex characters. UUIDs are 128 bits → 32 hex chars (with four hyphens → 36). Verbose, but supported everywhere.

Base32 (RFC 4648) — designed for checksums and DNS. Uses A–Z, 2–7 — 32 characters. Deliberately avoids 0 and 1 (too similar to O and I). Used in TOTP secrets (Google Authenticator), AWS resource IDs, and ULID's Crockford variant. Safe for case-folding environments.

Base36 (0–9, a–z) — fully case-insensitive. 36 characters. Safe across all systems that normalize to uppercase or lowercase. 6 chars → 36^6 ≈ 2.2 billion combinations. The right choice when IDs might be spoken aloud, typed manually, or passed through case-insensitive routing layers.

Base58 — Base62 minus the four visually ambiguous characters. Removes 0 (zero), O (capital-oh), I (capital-i), and l (lowercase-L). 58 characters. Used by Bitcoin addresses and Flickr short URLs. Produces slightly longer IDs than Base62 for the same namespace, but is the correct choice for any code a human will manually type — gift cards, referral codes, license keys.

Base62 (0–9, a–z, A–Z) — the standard URL shortener alphabet. 62 characters, case-sensitive. 6 chars → 62^6 ≈ 56 billion combinations. bit.ly, TinyURL, and YouTube video IDs all use Base62 or a close variant. The tradeoff is case-sensitivity — it breaks in systems that fold case, and in some email clients that linkify URLs differently from what is intended.

EncodingAlphabet sizeAmbiguous charsChars for 2^32Chars for 2^64Primary use case
Base16 (Hex)16None816UUIDs, SHA hashes
Base3232None (by design)713TOTP secrets, AWS IDs
Base3636None713Human-readable codes
Base5858None (removed)611Bitcoin addresses, referral codes
Base62620/O, I/l611URL shorteners, video IDs

The practical takeaway: if humans will type the code, use Base58. If the ID is machine-copied (clicked links, API calls), use Base62 for the denser namespace. If any layer in your stack is case-insensitive, use Base36.


⚙️ How Each ID Strategy Generates Uniqueness at Scale

Auto-Increment + Base Encoding

The database sequence generates a monotonically increasing integer. The application encodes it in Base62 (or Base58) to produce a short, human-friendly string. Used by bit.ly and early YouTube. 6 chars of Base62 gives 56 billion unique IDs from a single counter.

The fundamental weakness is the single point of failure: if the counter database goes down, ID generation stops. Sequential IDs also leak business scale — a competitor can estimate your total URL count by probing bit.ly/1 through bit.ly/xyz123. The production fix is the Flickr Ticket Server pattern: a dedicated database pre-allocates ID ranges (e.g., app-server-1 gets 1–100,000; app-server-2 gets 100,001–200,000), allowing parallel ID generation without per-request DB coordination.

UUID v4 — Random, Globally Unique, Coordination-Free

A UUID v4 is a 128-bit random number formatted as xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx. The collision probability at 1 billion records is approximately 10⁻¹⁰ — effectively zero given the birthday problem at 2^122 scale. UUID v4 is the right choice when global uniqueness is paramount, IDs must be completely opaque, and no coordination between generators is possible.

The severe tradeoff is B-tree fragmentation at scale. UUID v4 is random, which means every insert targets a random position in the database index. At 500 million rows, almost every insert page-faults (the target leaf node is not in the buffer pool). This makes UUID v4 a serious performance hazard as a primary key for high-volume tables.

UUID v7 — Time-Ordered UUID Without the Performance Penalty

UUID v7 places a 48-bit Unix millisecond timestamp in the leading bits, followed by 74 bits of randomness. This makes UUID v7 lexicographically sortable — new records always insert near the rightmost leaf of the B-tree, preserving sequential locality. The collision probability remains negligible. RFC 9562 (2024) standardized UUID v7 and it is now the recommended UUID version for most database primary key use cases where you would otherwise use UUID v4.

Snowflake ID — 64-Bit, Time-Ordered, No Coordinator Needed

Twitter invented Snowflake in 2010 to generate IDs at millions per second across thousands of servers with zero inter-node coordination. A Snowflake ID packs three independent fields into a single 64-bit integer:

  • 41 bits — Unix timestamp in milliseconds since a custom epoch (~69 years of ID space)
  • 10 bits — machine/node ID (1,024 nodes supported)
  • 12 bits — per-millisecond sequence counter per node (4,096 IDs per ms per node)

Across 1,024 nodes, that is roughly 4 million IDs per millisecond globally, with no shared state. IDs are naturally time-sorted, fit in a 64-bit integer (8 bytes — compact, fast to index), and are used by Twitter, Discord, Instagram, and LinkedIn in modified forms.

ULID — UUID Replacement With Built-In Sortability

ULID (Universally Unique Lexicographically Sortable Identifier) combines 48 bits of millisecond timestamp with 80 bits of randomness, encoded in Crockford Base32 to produce a 26-character string. It is URL-safe, case-insensitive, and lexicographically sortable. Designed as a drop-in UUID replacement for systems that need sortability in a human-readable string format without switching to an integer type.

NanoID — Compact, Cryptographically Random, URL-Safe

NanoID generates cryptographically secure random IDs using a configurable alphabet and length. The default output is 21 URL-safe characters (~149 bits of entropy), making it more compact and safer than UUID v4 for web-facing use. The tradeoff: NanoID is not sortable, requires an external library, and is not natively supported by database ID generation functions.

Hash-Based IDs — Deterministic Content-Addressed Keys

Take content (a URL, a file, a payload), compute SHA-256, truncate to N bytes, encode in Base62. This is deterministic: the same input always produces the same ID. Free deduplication — if you store the same URL twice, you compute the same ID and find the existing record without a lookup. Used in Git object IDs and CDN cache keys. Not suitable as a primary key without a unique constraint check — truncated hashes eventually collide as the namespace fills.

Sonyflake and KSUID — Snowflake Variants

Sonyflake is Sony's 63-bit Snowflake variant with a longer timestamp range and a simpler bit layout targeting single-datacenter deployments. KSUID (K-Sortable Unique ID) uses a 32-bit Unix timestamp in seconds (coarser than milliseconds) plus 128 bits of random payload, encoded in Base62 to produce 27 characters. KSUID trades timestamp resolution for a larger random component, making it appropriate for audit logs and event streams where second-level sorting is sufficient.


🧠 Inside Snowflake: Why a 64-Bit Integer Achieves Global Uniqueness Without a Central Database

The Internals: Bit Layout, Node Assignment, and Clock Skew

Snowflake's uniqueness guarantee comes from combining three orthogonal sources of differentiation into a single int64. The bit layout:

Bits:   63        22      12       0
        |  41-bit  |  10-bit  |  12-bit  |
        | timestamp|  node ID | sequence |
        | (ms)     | (01023) | (04095) |

Within each millisecond, the sequence counter increments per ID generated on a given node. When the sequence counter overflows 4,095 within one millisecond, the generator waits for the clock to tick forward. Node IDs are assigned at service startup — typically via ZooKeeper, Kubernetes pod ordinals, or a startup-time registration with a coordination service.

The critical vulnerability is clock skew. If a server's clock moves backward (NTP correction, leap second, VM migration), and the Snowflake generator does not detect it, two different IDs generated at the "same" timestamp on the same node could collide if the sequence counter has reset. Production Snowflake implementations detect clock regression and refuse to issue IDs until the clock catches up — they block rather than risk a duplicate.

Performance Analysis: Why Snowflake and UUID v7 Outperform UUID v4 at 500M Rows

The B-tree index underpins primary key lookups in every major relational database. A B-tree is a sorted structure. Inserting a random UUID v4 means the database must find the correct leaf node for that value — which could be anywhere in the tree. At scale, nearly every insert page-faults because the target leaf node is cold in the buffer pool.

Snowflake IDs and UUID v7 are monotonically increasing within millisecond granularity. Every new record inserts at the rightmost leaf of the B-tree. That page is hot. Inserts become sequential writes — far more efficient on both spinning disks and NVMe SSDs.

MetricUUID v4Snowflake / UUID v7
Insert patternRandom B-tree page (anywhere in tree)Append to rightmost leaf
Buffer pool localityLow — most inserts are cold missesHigh — rightmost leaf always hot
Index fragmentation after 500M rowsHeavy — many half-filled pagesLow — compact sequential structure
Typical insert throughput (500M rows, PostgreSQL)~20–30K/s~80–100K/s
ID storage size16 bytes (UUID)8 bytes (int64 Snowflake)

This is not a theoretical concern. Percona, Uber Engineering, and PlanetScale have all published post-mortems on UUID v4 index fragmentation causing production degradation. It was one of the primary motivations behind UUID v7's standardization.


📊 Visualizing the URL Shortener Write Path with Bloom Filter Guard

When a user submits a long URL, the system must generate a short ID, verify it has not been used before, and atomically store the mapping. A Bloom filter acts as a probabilistic pre-check that avoids a database roundtrip for the vast majority of new IDs — if the filter says "definitely not seen before," skip the DB check. If it says "possibly seen," fall back to the database.

The diagram below shows the write path for a Base62 counter-based URL shortener. The Bloom filter provides an O(1) probabilistic "seen before" check. The database unique constraint is the authoritative safety net that catches any genuine collision the Bloom filter missed.

flowchart TD
    A[User submits long URL] --> B[App Server]
    B --> C{Bloom Filter: \nID seen before?}
    C -->|Definitely not seen| D[Encode counter → Base62 short ID]
    C -->|Possibly seen — collision risk| E[Increment counter, get next ID]
    D --> F{DB unique constraint: \nID already exists?}
    F -->|ID is unique| G[Write mapping to DB]
    F -->|Collision confirmed| E
    G --> H[Add new ID to Bloom Filter]
    G --> I[Return short URL to user]
    E --> D

The Bloom filter eliminates database roundtrips for approximately 99.9% of new ID checks. The unique constraint in the database eliminates the rare genuine collision. Together they provide both throughput efficiency and correctness. Note: after a cold Bloom filter restart (e.g., new deployment), all IDs look "new" — the DB constraint remains the ground truth in all cases.


URL shorteners are the canonical ID strategy interview problem because every constraint is concrete and measurable. Let us walk through the decision in order.

Requirement baseline: 500M URLs total. 100:1 read:write ratio. IDs ≤ 8 characters. URL-safe. Zero collisions acceptable.

Step 1 — Namespace sizing. At 6-char Base62: 62^6 = 56.8 billion possible IDs. With 500M records, namespace saturation is only 0.9% — comfortable. But the birthday paradox matters: P(collision) ≈ n² / (2 × namespace). At 500M records with 6-char Base62, this is roughly 0.22% — unacceptably high for a production system. The correct choice is 7-char Base62 (3.5 trillion IDs), dropping collision probability to under 0.0036% at 500M records.

Step 2 — Counter vs hash. A counter produces predictable IDs, which is fine for URLs (they are meant to be public). A hash produces random IDs, making the system harder to enumerate but requiring a collision check on every write. For a URL shortener without sensitive ID requirements, a counter with Bloom filter is simpler and safer.

Step 3 — Multi-master counter coordination. A single database counter is a single point of failure. The Flickr Ticket Server pattern solves this: each app server pre-allocates a range of counter values (app-server-1 holds 1–100,000 in memory). When it exhausts its range, it requests a new block atomically. This eliminates per-request DB coordination while maintaining global uniqueness.

Step 4 — Base choice. URL shortener IDs are machine-copied — users click links, they do not type them. Base62 is correct over Base58: 62^7 = 3.5 trillion vs 58^7 = 2.2 trillion. More IDs per character. Case-sensitivity is acceptable because HTTP clients preserve URL case.

Step 5 — Read path caching. At 100:1 read:write ratio, read performance dominates. Short-ID-to-URL mappings live in a Redis cluster. 500M URLs × 200 bytes per entry ≈ 100GB — fits in a Redis cluster with hot IDs served from L1 memory. Cache miss triggers a DB lookup and a cache warm.

DecisionOption AOption BChosen
ID length6 chars (56.8B namespace)7 chars (3.5T namespace)7 chars
GenerationHash-based (random)Counter + ticket serverCounter + ticket server
BaseBase58 (human-safe)Base62 (denser namespace)Base62
Collision checkDB only (slower)Bloom filter + DBBloom filter + DB

⚖️ Random vs Time-Ordered: The Tradeoff That Breaks B-Tree Indexes in Production

The most common production ID mistake is using UUID v4 as a primary key and discovering the performance cliff at 50–100M rows — by which point a schema migration is painful.

Why random UUIDs hurt B-tree indexes. A B-tree is a sorted structure. Every UUID v4 insert targets a random page anywhere in the tree. At 100M rows with 16KB pages, the working set of "hot" pages is a tiny fraction of the total tree. Nearly every insert page-faults. The database reads a cold page from disk, modifies it, and writes it back. This is the definition of write amplification on random I/O.

Why time-ordered IDs fix it. If every new ID is larger than all existing IDs (monotonically increasing within the same millisecond), every insert appends to the rightmost page. That page is almost always in the buffer pool. Inserts become near-sequential writes — dramatically more efficient and far kinder to SSD wear patterns.

The privacy/security tradeoff. Time-ordered IDs reveal when a record was created and allow inference about creation rate. For internal order IDs or audit log entries, this is completely acceptable. For externally-visible user IDs, it is a mild privacy concern. For session tokens or API keys, it is a security risk — never use time-ordered IDs for anything that must be unguessable.

The two-ID pattern. Generate an internal Snowflake ID (fast inserts, sortable for queries) and a separate external random token (UUID v4 or NanoID, opaque to callers) for user-facing identifiers. Stripe uses this pattern: every charge has an internal ledger ID (sortable, used for ordering) and a public charge ID (ch_xxx, opaque, used in APIs and webhooks). GitHub uses it for repository IDs. Twilio uses it for every resource.

ScenarioUUID v4Snowflake or UUID v7
DB insert throughput at 500M rowsDegrades ~70% vs baselineStays near baseline
Time-sortable queries
Reveals creation timeNoYes
Safe for session tokens❌ — never use time-ordered tokens
Safe for order / audit IDs❌ — fragmented index
ID storage overhead16 bytes8 bytes (Snowflake)

🧭 Choosing Your ID Strategy: Decision Flowchart and Use-Case Matrix

The right strategy is determined by your constraints. Start from length, then sortability, then coordination model. This flowchart covers 95% of real-world decisions.

flowchart TD
    A[What are your constraints?] --> B{Need short\nhuman-shareable IDs?\n≤8 chars}
    B -->|Yes| C{Sortability\nrequired?}
    C -->|Yes| D[ULID or KSUID\n26–27 chars sortable]
    C -->|No| E[Base62 counter\n+ Bloom filter]
    B -->|No — internal or longer OK| F{Need time-sorted IDs\nfor DB performance?}
    F -->|Yes| G{Multi-node\ndistributed?}
    G -->|Yes| H[Snowflake ID\nor UUID v7]
    G -->|No — single DB| I[UUID v7\nor auto-increment]
    F -->|No sort needed| J{Content\ndeduplication?}
    J -->|Yes| K[Truncated SHA-256\n→ Base62]
    J -->|No| L{Security token?\nMust be unguessable}
    L -->|Yes| M[NanoID or UUID v4]
    L -->|No| N[UUID v4\nis sufficient]

When two strategies appear equal on the flowchart, prefer the one your database or ORM natively supports — operational familiarity reduces incident blast radius.

Use-case decision table:

Use CaseRecommended StrategyCore Reason
URL shortener (public)Base62 counter + Bloom filterShort, URL-safe, collision-controlled with Bloom guard
User IDs (global SaaS)UUID v7 or SnowflakeSortable for queries, no central coordinator required
Order IDs (internal payments)SnowflakeCompact int64, time-ordered for ledger scans, high throughput
Session tokens / API keysNanoID or UUID v4Unpredictable, URL-safe, no ordering leak
Content-addressed cache keysSHA-256 truncated → Base62Deterministic, free deduplication, idempotent writes
Referral / gift card codesBase58Removes 0/O/I/l confusion; zero mistype errors
Audit / event log IDsULIDTime-ordered, readable, URL-safe, case-insensitive
Distributed trace IDsUUID v4 or SnowflakeZero coordination needed, generated at microsecond scale

🧪 Sizing Your Namespace — The Birthday Paradox Applied to ID Collision Risk

The birthday paradox reveals a counterintuitive truth: you do not need to fill a namespace before collisions become likely. You need roughly √(namespace_size) records before the collision probability exceeds 50%.

The approximation formula is: P(collision) ≈ n² / (2 × namespace_size)

Where n is the total number of IDs generated.

Example: Base62 Namespace Collision Probability by Record Count

ID LengthNamespace SizeAt 10M recordsAt 100M recordsAt 500M recordsAt 1B records
6 chars (Base62)56.8B0.00088%0.088%2.2%8.8%
7 chars (Base62)3.5T0.0000143%0.00143%0.036%0.143%
8 chars (Base62)218T0.00000023%0.000023%0.00057%0.0023%

The 6-char Base62 row reveals why the URL shortener must use 7 characters. At 500M records, a 6-char namespace has a 2.2% collision probability — unacceptable for a production system where every collision is a wrong redirect.

Sizing rule: if your expected record count is more than √(namespace_size / 2), add one character to your ID length. For 500M records, √(56.8B / 2) ≈ 169K — 500M is far beyond this threshold for a 6-char Base62 namespace. Move to 7 chars where √(3.5T / 2) ≈ 1.3M — still comfortably above 500M, which is why the Bloom filter + unique constraint defense-in-depth approach remains mandatory.

A Bloom filter pre-check on the write path reduces collision detection DB roundtrips by ~99.9%, but it is not the safety net — the DB unique constraint is. Size your namespace first, then add the Bloom filter, then enforce the unique constraint. These are three independent layers of correctness, not alternatives to each other.


🛠️ hashids: Human-Friendly ID Obfuscation for Database-Backed Systems

hashids is an open-source library (available in 30+ languages) that encodes integers into short shuffled strings using a custom alphabet and a salt. It is not a cryptographic primitive — it is an obfuscation tool that prevents casual enumeration of sequential database IDs.

Unlike plain Base62 encoding, hashids applies a permutation based on your salt so that sequential integers (1, 2, 3) do not produce sequential strings. This hides business scale: a competitor examining https://paste.example.com/W7eT cannot easily determine that this corresponds to database row 12,345,678 or infer the total number of pastes.

// hashids concept — library usage pattern only, not a full application
Hashids hashids = new Hashids("my-salt", 6, "abcdefghijklmnopqrstuvwxyz0123456789");
String encoded = hashids.encode(12345678L); // → "k4k8k" (example output)
long[] decoded = hashids.decode("k4k8k");   // → [12345678]

hashids is deterministic (same input + salt + alphabet → same output) and reversible without a database lookup. It is a good fit for platforms that expose sequential database IDs externally and want basic obfuscation without maintaining a separate mapping table. Pastebin-style platforms, YouTube-style video IDs, and short URL generators use this pattern.

The key caveat: hashids is reversible by anyone who knows your salt and alphabet. It is obfuscation, not security. Never use it for tokens that must be unguessable.

For a companion deep-dive on content-addressed storage and deterministic ID generation strategies, a follow-up post is planned in this series.


🛠️ Redis INCR + Base62 — The Ticket Server Pattern for Atomic Short ID Generation

Redis provides an atomic INCR command that increments a counter key and returns the new value in a single operation. No application-level locking is needed — Redis's single-threaded command execution guarantees that no two clients ever receive the same counter value.

INCR url_counter   # Returns: 12345678
# App layer: encode 12345678 in Base62 → "W7eT"
# Store mapping: "W7eT""https://example.com/long-url"

For multi-region systems, run one Redis counter per region with a regional prefix or a large offset in the counter start (region-A starts at 0, region-B starts at 10¹², region-C starts at 2×10¹²). This eliminates cross-region coordination while preserving global uniqueness.

For durable ID generation, persist the counter (appendonly yes in Redis config) or use a PostgreSQL sequence (NEXTVAL). A Redis instance that restarts without persistence resets the counter to 0, causing ID reuse — a subtle and dangerous production bug. The ticket server pattern with pre-allocated ranges in application memory (each server holds a block of 10,000 IDs) adds a buffer against counter service downtime without compromising uniqueness.

For an architectural deep-dive on Redis as a coordination primitive, see the planned Redis Internals post in the How It Works: Internals Explained series.


📚 What Real Systems Got Wrong With ID Generation

Never expose auto-increment DB IDs externally. Sequential IDs reveal your total record count and creation rate. In 2012, a researcher enumerated Facebook profile IDs by incrementing from 1 and derived total user count data before a financial disclosure. Use an opaque encoding (hashids, Base62 shuffle, or UUID v4) for any ID that crosses the API boundary.

Always size your namespace 100× larger than your maximum expected record count. The birthday paradox means collision probability rises sharply once you pass √(namespace_size) records. Planning for 500M URLs means designing for a 50-billion-plus namespace — 7-char Base62 — not just "will it fit."

Time-ordered IDs dramatically outperform random UUIDs for database inserts. If you have an existing system with UUID v4 primary keys and inserts are degrading, migrating to UUID v7 is the lowest-friction fix — same 16-byte wire format, same column type, no schema migration, just a UUID version change. Percona reports 3–5× insert throughput improvement at 500M+ rows.

For human-entered codes, Base58 is mandatory, not optional. Confusing 0 with O or I with l causes measurable support incidents. E-commerce platforms that shipped gift card and referral code systems using Base62 without filtering ambiguous characters documented 2–5% redemption failure rates attributable to mistyping alone.

A Bloom filter pre-check is a throughput optimization, not the collision safety net. Bloom filters can only say "definitely not seen" or "possibly seen." After a cold start, the filter is empty and every ID looks new. The database unique constraint is the authoritative check. Treat the Bloom filter as a performance layer, not a correctness layer.

The two-ID pattern is the production-grade solution for IDs that are both fast and safe. Internal Snowflake or UUID v7 ID for database performance. Separate external random token (NanoID, UUID v4) for anything user-facing or security-sensitive. This is how Stripe, GitHub, and Twilio design their resource identifiers.


📌 ID Strategy Decision Cheat Sheet

  • Short shareable URLs: Base62 counter + Bloom filter + unique constraint. Use 7 chars for 500M+ records — 6 chars gives 2%+ collision probability at that scale.
  • Database primary keys at high volume: UUID v7 or Snowflake. UUID v4 causes B-tree fragmentation at 50M+ rows — this is a documented production problem, not a theoretical concern.
  • Global distributed IDs, no coordinator: Snowflake (compact int64, sortable, 4M IDs/ms across 1,024 nodes) or ULID (26-char string, sortable, URL-safe, case-insensitive).
  • Security tokens and session IDs: UUID v4 or NanoID — random, unpredictable, URL-safe. Never use time-ordered IDs for anything that must be unguessable.
  • Human-entered codes: Base58. Removing the four ambiguous characters (0, O, I, l) is the difference between a 0% and a 3% support ticket rate.
  • Content deduplication: Truncated SHA-256 in Base62. Deterministic, free deduplication. Still requires a DB unique check — hash collisions are rare but real.
  • The two-ID pattern: Snowflake internally, random token externally. This is what Stripe, GitHub, and Twilio do, and it solves the entire DB performance vs opacity tradeoff.

📝 Test Your ID Strategy Thinking

  1. Why does bit.ly use Base62 over Base58 for its short IDs?

    • A) Base62 is easier to implement in most programming languages
    • B) Base62 provides more IDs per character count, and short URLs are machine-copied rather than human-typed
    • C) Base58 is case-sensitive and would cause URL parsing errors in some browsers
    • D) Base62 is required by the HTTP specification for URL path segments Correct Answer: B — Base62's 62 characters give 56.8B IDs at 6 chars vs Base58's 38.1B. Since users click short URLs (not type them), the ambiguous-character risk is irrelevant.
  2. A distributed payment system must generate over 1 billion order IDs per day, and those IDs must be sortable for real-time ledger queries. Which strategy fits best?

    • A) UUID v4 — collision-free and completely distributed with no coordination
    • B) Base62 hash of order details — deterministic with free deduplication
    • C) Snowflake ID — time-ordered int64, no inter-node coordination, ~4M IDs/ms globally
    • D) PostgreSQL auto-increment — simple, sequential, and fast for single-node systems Correct Answer: C — Snowflake provides time-ordered int64 IDs at massive throughput with no coordinator. UUID v4 would cause B-tree fragmentation at 1B+ rows. Auto-increment is a single point of failure for a distributed system.
  3. A service has UUID v4 primary keys. Insert throughput has dropped 70% compared to 100M rows ago (now at 500M rows). What is the fix and why does it work?

    • A) Add a secondary index on the UUID column to speed up lookups
    • B) Switch to UUID v7 — the leading 48-bit timestamp makes new records insert at the rightmost B-tree leaf, eliminating random page faults
    • C) Increase database buffer pool RAM to cache more random B-tree pages
    • D) Partition the table by UUID prefix to distribute random inserts across smaller subtrees Correct Answer: B — UUID v7's time-ordered prefix restores sequential insert behavior. Options A, C, and D treat symptoms rather than the root cause (random page access pattern).
  4. Your referral code system shows a 4% redemption failure rate. Users call support saying the code "doesn't work." Codes are 8 characters using Base62. What is the most likely cause and the correct fix?

    • A) 8-char Base62 has insufficient namespace — users are guessing valid codes
    • B) Base62 includes visually ambiguous characters (0/O and I/l) — users misread them; switch to Base58
    • C) The Bloom filter false positive rate is too high — tune the filter bit-array size
    • D) The codes are too long — reduce to 6 characters to reduce transcription errors Correct Answer: B — The classic Base62 support ticket problem. Removing 0, O, I, l via Base58 eliminates ambiguous-character mistyping at the source.
  5. Open-ended challenge: At what approximate record count does a 7-character Base62 namespace become risky for collision without a uniqueness check? Design a system generating 10 billion short URLs over 5 years while keeping collision probability below 0.01%. Consider namespace sizing, write-path architecture, and what happens when the Bloom filter is cold after a deployment restart.



Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms