Advanced26 min readSystem DesignDistributed SystemsId Generation

ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond

How to choose the right uniqueness strategy — from 6-char short URLs to distributed int64 order IDs

System Design Interview Prep

Abstract Algorithms

·Apr 5, 2026·26 min read

More actions⌄

Practice Interview Mock Discussion

Reading progress

26 min left

Metadata and pacing⌄

Total read

26 min

Sections

◴ On this page⌄

📖 The Uniqueness Problem Hiding Inside Every Production System 🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings ⚙️ How Each ID Strategy Generates Uniqueness at Scale Auto-Increment + Base Encoding UUID v4 — Random, Globally Unique, Coordination-Free UUID v7 — Time-Ordered UUID Without the Performance Penalty Snowflake ID — 64-Bit, Time-Ordered, No Coordinator Needed ULID — UUID Replacement With Built-In Sortability NanoID — Compact, Cryptographically Random, URL-Safe Hash-Based IDs — Deterministic Content-Addressed Keys Sonyflake and KSUID — Snowflake Variants 🧠 Inside Snowflake: Why a 64-Bit Integer Achieves Global Uniqueness Without a Central Database The Internals: Bit Layout, Node Assignment, and Clock Skew Performance Analysis: Why Snowflake and UUID v7 Outperform UUID v4 at 500M Rows 📊 Visualizing the URL Shortener Write Path with Bloom Filter Guard 🌍 The URL Shortener Deep Dive — Making 500M Short Links Work Without a Single Collision ⚖️ Random vs Time-Ordered: The Tradeoff That Breaks B-Tree Indexes in Production 🧭 Choosing Your ID Strategy: Decision Flowchart and Use-Case Matrix 🧪 Sizing Your Namespace — The Birthday Paradox Applied to ID Collision Risk Example: Base62 Namespace Collision Probability by Record Count 🛠️ hashids: Human-Friendly ID Obfuscation for Database-Backed Systems 🛠️ Redis INCR + Base62 — The Ticket Server Pattern for Atomic Short ID Generation 📚 What Real Systems Got Wrong With ID Generation 📌 ID Strategy Decision Cheat Sheet 🔗 Keep Learning: Related Posts

✣ Need another angle?⌄

Switch the article companion into a lower-complexity framing, then quiz yourself when you are ready.

Advanced26 min readSystem DesignDistributed SystemsId Generation

ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond

How to choose the right uniqueness strategy — from 6-char short URLs to distributed int64 order IDs

Abstract Algorithms

Apr 5, 2026 · 26 min read

Interview

Helpful?

📖 The Uniqueness Problem Hiding Inside Every Production System

TLDR: Short shareable IDs need Base62 (URL shorteners).

1. Overview

How to choose the right uniqueness strategy — from 6-char short URLs to distributed int64 order IDs

⌁

Why it matters

TLDR: Short shareable IDs need Base62 (URL shorteners).

Show high-level concept flow⌄

📖 The Uniqueness Problem Hiding Inside Every Production System

Starting point

→

🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings

Next concept

→

⚙️ How Each ID Strategy Generates Uniqueness at Scale

Next concept

→

🧠 Inside Snowflake: Why a 64-Bit Integer Achieves Global Uniqueness Without a Central Database

Next concept

→

📊 Visualizing the URL Shortener Write Path with Bloom Filter Guard

Outcome

Committed

At a glance

DifficultyAdvanced ▥

Concepts25

Estimated time26 min

PrerequisitesSystem Design, Distributed Systems

System lens

See ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond as a living topology.

How to choose the right uniqueness strategy — from 6-char short URLs to distributed int64 order IDs

📖 The Uniqueness Problem Hiding Inside Every Production System

Ingress and assumptions

🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings

State transition

⚙️ How Each ID Strategy Generates Uniqueness at Scale

State transition

🧠 Inside Snowflake: Why a 64-Bit Integer Achieves Global Uniqueness Without a Central Database

State transition

📊 Visualizing the URL Shortener Write Path with Bloom Filter Guard

Outcome and guarantees

The article becomes easier when every section maps to a state change, a guarantee, or a failure boundary.

Narrative transition

Move from explanation to operating judgment.

Use these checkpoints as the conceptual pacing layer before continuing into the full article.

!Why this matters

TLDR: Short shareable IDs need Base62 (URL shorteners).

#Key section to watch

Pay attention to "🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings"; it usually contains the main mechanism or tradeoff.

?Interview angle

Be ready to explain 📖 The Uniqueness Problem Hiding Inside Every Production System and 🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings with one concrete example and one tradeoff.

Tradeoff path 1

📖 The Uniqueness Problem Hiding Inside Every Production System: speed-first

TLDR: Short shareable IDs need Base62 (URL shorteners).

Tradeoff path 2

🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings: reliability-first

Database primary keys at scale need time ordered IDs (Snowflake, UUID v7).

Failure rehearsal

Pressure-test the mental model.

📖 The Uniqueness Problem Hiding Inside Every Production System misunderstood

High model quality can still produce incorrect outputs without grounding and verification.

Mitigation: Revisit 📖 The Uniqueness Problem Hiding Inside Every Production System and validate the first principles.

Risk 68%

🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings tradeoff missed

Low latency does not automatically mean high throughput under contention.

Mitigation: Compare against 🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings and document the tradeoff.

Risk 58%

Back to the article

Continue into the authored sections with the topology in mind: each heading should now answer what changes, what can fail, and what guarantee the system is trying to preserve.

TLDR: Short shareable IDs need Base62 (URL shorteners). Database primary keys at scale need time-ordered IDs (Snowflake, UUID v7). Security tokens need random IDs (UUID v4, NanoID). Picking the wrong strategy either causes B-tree fragmentation at 50M+ rows, production collisions, or IDs that leak your business metrics to competitors.

📖 The Uniqueness Problem Hiding Inside Every Production System

You have 500 million shortened URLs. Every ID collision is a production incident — the wrong page opens for someone clicking a shared link. How do you guarantee uniqueness at scale while keeping IDs short enough for humans to share?

This is not a niche URL shortener problem. The same decision shows up everywhere:

User IDs in a global SaaS product must be unique across data centers with no central coordinator.
Order IDs in a payments ledger must be time-sortable so audit queries stay fast.
Session tokens must be unpredictable so attackers cannot guess or enumerate them.
Distributed trace IDs must be generated at microsecond throughput with zero coordination overhead.
Content cache keys must be deterministic — same content, same key — so deduplication is automatic.

Each use case has a different constraint set, and each constraint set has a best-fit strategy. Defaulting to "UUID" everywhere, or keeping auto-increment because it is simple, eventually causes a real problem at scale.

The five decision axes that drive every ID strategy choice are:

Axis	What it controls	Concrete constraint example
Length	How many characters can the ID use?	URL shortener: ≤ 8 chars
Charset	Which characters are safe in the context?	URL-safe, case-insensitive, no ambiguous chars
Sortability	Must IDs sort by creation time?	Order IDs: yes. Session tokens: never
Generation model	Can a central coordinator exist?	Global distributed systems: coordinator is a bottleneck
Collision guarantee	Probabilistic or structural uniqueness?	Payment IDs: must be structurally collision-free

Here is a quick decision matrix across the strategies this post covers:

Strategy	Short IDs	Sortable	Fully Distributed	Collision-Free	URL-Safe
Auto-increment + Base62	✅	✅	❌ (needs ticket server)	✅	✅
UUID v4	❌ (36 chars)	❌	✅	✅ (probabilistic)	⚠️
UUID v7	❌ (36 chars)	✅	✅	✅ (probabilistic)	⚠️
Snowflake	✅ (int64)	✅	✅	✅ (structural)	✅
ULID	⚠️ (26 chars)	✅	✅	✅ (probabilistic)	✅
Base62 hash + Bloom filter	✅	❌	✅	✅ (with check)	✅
NanoID	⚠️ (21 chars)	❌	✅	✅ (probabilistic)	✅

🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings

"Base encoding" sounds cryptographic, but the concept is simple: you have a number, and you want to represent it using a given set of characters. The larger your character set, the fewer characters you need for the same numeric value.

Think of it like counting in different number systems. Decimal (base-10) needs more digits than hexadecimal (base-16) to represent large numbers. Base62 (62 characters) needs even fewer digits because each position carries more information.

Base16 (Hex) — the verbose universal baseline. The alphabet is 0–9, a–f — 16 characters. One hex digit represents 4 bits. A 64-bit number needs 16 hex characters. UUIDs are 128 bits → 32 hex chars (with four hyphens → 36). Verbose, but supported everywhere.

Base32 (RFC 4648) — designed for checksums and DNS. Uses A–Z, 2–7 — 32 characters. Deliberately avoids 0 and 1 (too similar to O and I). Used in TOTP secrets (Google Authenticator), AWS resource IDs, and ULID's Crockford variant. Safe for case-folding environments.

Base36 (0–9, a–z) — fully case-insensitive. 36 characters. Safe across all systems that normalize to uppercase or lowercase. 6 chars → 36^6 ≈ 2.2 billion combinations. The right choice when IDs might be spoken aloud, typed manually, or passed through case-insensitive routing layers.

Base58 — Base62 minus the four visually ambiguous characters. Removes 0 (zero), O (capital-oh), I (capital-i), and l (lowercase-L). 58 characters. Used by Bitcoin addresses and Flickr short URLs. Produces slightly longer IDs than Base62 for the same namespace, but is the correct choice for any code a human will manually type — gift cards, referral codes, license keys.

Base62 (0–9, a–z, A–Z) — the standard URL shortener alphabet. 62 characters, case-sensitive. 6 chars → 62^6 ≈ 56 billion combinations. bit.ly, TinyURL, and YouTube video IDs all use Base62 or a close variant. The tradeoff is case-sensitivity — it breaks in systems that fold case, and in some email clients that linkify URLs differently from what is intended.

Encoding	Alphabet size	Ambiguous chars	Chars for 2^32	Chars for 2^64	Primary use case
Base16 (Hex)	16	None	8	16	UUIDs, SHA hashes
Base32	32	None (by design)	7	13	TOTP secrets, AWS IDs
Base36	36	None	7	13	Human-readable codes
Base58	58	None (removed)	6	11	Bitcoin addresses, referral codes
Base62	62	0/O, I/l	6	11	URL shorteners, video IDs

The practical takeaway: if humans will type the code, use Base58. If the ID is machine-copied (clicked links, API calls), use Base62 for the denser namespace. If any layer in your stack is case-insensitive, use Base36.

⚙️ How Each ID Strategy Generates Uniqueness at Scale

Auto-Increment + Base Encoding

The database sequence generates a monotonically increasing integer. The application encodes it in Base62 (or Base58) to produce a short, human-friendly string. Used by bit.ly and early YouTube. 6 chars of Base62 gives 56 billion unique IDs from a single counter.

The fundamental weakness is the single point of failure: if the counter database goes down, ID generation stops. Sequential IDs also leak business scale — a competitor can estimate your total URL count by probing bit.ly/1 through bit.ly/xyz123. The production fix is the Flickr Ticket Server pattern: a dedicated database pre-allocates ID ranges (e.g., app-server-1 gets 1–100,000; app-server-2 gets 100,001–200,000), allowing parallel ID generation without per-request DB coordination.

UUID v4 — Random, Globally Unique, Coordination-Free

A UUID v4 is a 128-bit random number formatted as xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx. The collision probability at 1 billion records is approximately 10⁻¹⁰ — effectively zero given the birthday problem at 2^122 scale. UUID v4 is the right choice when global uniqueness is paramount, IDs must be completely opaque, and no coordination between generators is possible.

The severe tradeoff is B-tree fragmentation at scale. UUID v4 is random, which means every insert targets a random position in the database index. At 500 million rows, almost every insert page-faults (the target leaf node is not in the buffer pool). This makes UUID v4 a serious performance hazard as a primary key for high-volume tables.

UUID v7 — Time-Ordered UUID Without the Performance Penalty

UUID v7 places a 48-bit Unix millisecond timestamp in the leading bits, followed by 74 bits of randomness. This makes UUID v7 lexicographically sortable — new records always insert near the rightmost leaf of the B-tree, preserving sequential locality. The collision probability remains negligible. RFC 9562 (2024) standardized UUID v7 and it is now the recommended UUID version for most database primary key use cases where you would otherwise use UUID v4.

Snowflake ID — 64-Bit, Time-Ordered, No Coordinator Needed

Twitter invented Snowflake in 2010 to generate IDs at millions per second across thousands of servers with zero inter-node coordination. A Snowflake ID packs three independent fields into a single 64-bit integer:

41 bits — Unix timestamp in milliseconds since a custom epoch (~69 years of ID space)
10 bits — machine/node ID (1,024 nodes supported)
12 bits — per-millisecond sequence counter per node (4,096 IDs per ms per node)

Across 1,024 nodes, that is roughly 4 million IDs per millisecond globally, with no shared state. IDs are naturally time-sorted, fit in a 64-bit integer (8 bytes — compact, fast to index), and are used by Twitter, Discord, Instagram, and LinkedIn in modified forms.

ULID — UUID Replacement With Built-In Sortability

ULID (Universally Unique Lexicographically Sortable Identifier) combines 48 bits of millisecond timestamp with 80 bits of randomness, encoded in Crockford Base32 to produce a 26-character string. It is URL-safe, case-insensitive, and lexicographically sortable. Designed as a drop-in UUID replacement for systems that need sortability in a human-readable string format without switching to an integer type.

NanoID — Compact, Cryptographically Random, URL-Safe

NanoID generates cryptographically secure random IDs using a configurable alphabet and length. The default output is 21 URL-safe characters (~149 bits of entropy), making it more compact and safer than UUID v4 for web-facing use. The tradeoff: NanoID is not sortable, requires an external library, and is not natively supported by database ID generation functions.

Hash-Based IDs — Deterministic Content-Addressed Keys

Take content (a URL, a file, a payload), compute SHA-256, truncate to N bytes, encode in Base62. This is deterministic: the same input always produces the same ID. Free deduplication — if you store the same URL twice, you compute the same ID and find the existing record without a lookup. Used in Git object IDs and CDN cache keys. Not suitable as a primary key without a unique constraint check — truncated hashes eventually collide as the namespace fills.

Sonyflake and KSUID — Snowflake Variants

Sonyflake is Sony's 63-bit Snowflake variant with a longer timestamp range and a simpler bit layout targeting single-datacenter deployments. KSUID (K-Sortable Unique ID) uses a 32-bit Unix timestamp in seconds (coarser than milliseconds) plus 128 bits of random payload, encoded in Base62 to produce 27 characters. KSUID trades timestamp resolution for a larger random component, making it appropriate for audit logs and event streams where second-level sorting is sufficient.

📊 Visual Reference

flowchart TD
    Driver["Driver Node"]
    Planner["Query Planner"]
    Optimizer["Catalyst Optimizer"]
    Executor["Executor Task"]

    Driver -->|"Process Flow"| Planner
    Planner --> Optimizer
    Optimizer -->|"Optimized Plan"| Executor
    Executor -->|"Result"| Driver

🧠 Inside Snowflake: Why a 64-Bit Integer Achieves Global Uniqueness Without a Central Database

The Internals: Bit Layout, Node Assignment, and Clock Skew

Snowflake's uniqueness guarantee comes from combining three orthogonal sources of differentiation into a single int64. The bit layout:

A Snowflake ID packs three orthogonal sources of differentiation into a single 64-bit integer. The high-order 41 bits hold the Unix millisecond timestamp, giving approximately 69 years of ID space from a configurable custom epoch. The next 10 bits hold the node ID, supporting up to 1,024 independent generators with no shared state. The low-order 12 bits are a per-millisecond sequence counter, allowing up to 4,096 IDs per millisecond per node.

Within each millisecond, the sequence counter increments per ID generated on a given node. When the sequence counter overflows 4,095 within one millisecond, the generator waits for the clock to tick forward. Node IDs are assigned at service startup — typically via ZooKeeper, Kubernetes pod ordinals, or a startup-time registration with a coordination service.

The critical vulnerability is clock skew. If a server's clock moves backward (NTP correction, leap second, VM migration), and the Snowflake generator does not detect it, two different IDs generated at the "same" timestamp on the same node could collide if the sequence counter has reset. Production Snowflake implementations detect clock regression and refuse to issue IDs until the clock catches up — they block rather than risk a duplicate.

Performance Analysis: Why Snowflake and UUID v7 Outperform UUID v4 at 500M Rows

The B-tree index underpins primary key lookups in every major relational database. A B-tree is a sorted structure. Inserting a random UUID v4 means the database must find the correct leaf node for that value — which could be anywhere in the tree. At scale, nearly every insert page-faults because the target leaf node is cold in the buffer pool.

Snowflake IDs and UUID v7 are monotonically increasing within millisecond granularity. Every new record inserts at the rightmost leaf of the B-tree. That page is hot. Inserts become sequential writes — far more efficient on both spinning disks and NVMe SSDs.

Metric	UUID v4	Snowflake / UUID v7
Insert pattern	Random B-tree page (anywhere in tree)	Append to rightmost leaf
Buffer pool locality	Low — most inserts are cold misses	High — rightmost leaf always hot
Index fragmentation after 500M rows	Heavy — many half-filled pages	Low — compact sequential structure
Typical insert throughput (500M rows, PostgreSQL)	~20–30K/s	~80–100K/s
ID storage size	16 bytes (UUID)	8 bytes (int64 Snowflake)

This is not a theoretical concern. Percona, Uber Engineering, and PlanetScale have all published post-mortems on UUID v4 index fragmentation causing production degradation. It was one of the primary motivations behind UUID v7's standardization.

📊 Visualizing the URL Shortener Write Path with Bloom Filter Guard

When a user submits a long URL, the system must generate a short ID, verify it has not been used before, and atomically store the mapping. A Bloom filter acts as a probabilistic pre-check that avoids a database roundtrip for the vast majority of new IDs — if the filter says "definitely not seen before," skip the DB check. If it says "possibly seen," fall back to the database.

The diagram below shows the write path for a Base62 counter-based URL shortener. The Bloom filter provides an O(1) probabilistic "seen before" check. The database unique constraint is the authoritative safety net that catches any genuine collision the Bloom filter missed.

flowchart TD
    A[User submits long URL] --> B[App Server]
    B --> C{Bloom Filter:  ID seen before?}
    C -->|Definitely not seen| D[Encode counter  Base62 short ID]
    C -->|Possibly seen  collision risk| E[Increment counter, get next ID]
    D --> F{DB unique constraint:  ID already exists?}
    F -->|ID is unique| G[Write mapping to DB]
    F -->|Collision confirmed| E
    G --> H[Add new ID to Bloom Filter]
    G --> I[Return short URL to user]
    E --> D

The Bloom filter eliminates database roundtrips for approximately 99.9% of new ID checks. The unique constraint in the database eliminates the rare genuine collision. Together they provide both throughput efficiency and correctness. Note: after a cold Bloom filter restart (e.g., new deployment), all IDs look "new" — the DB constraint remains the ground truth in all cases.

🌍 The URL Shortener Deep Dive — Making 500M Short Links Work Without a Single Collision

URL shorteners are the canonical ID strategy interview problem because every constraint is concrete and measurable. Let us walk through the decision in order.

Requirement baseline: 500M URLs total. 100:1 read:write ratio. IDs ≤ 8 characters. URL-safe. Zero collisions acceptable.

Step 1 — Namespace sizing. At 6-char Base62: 62^6 = 56.8 billion possible IDs. With 500M records, namespace saturation is only 0.9% — comfortable. But the birthday paradox matters: P(collision) ≈ n² / (2 × namespace). At 500M records with 6-char Base62, this is roughly 0.22% — unacceptably high for a production system. The correct choice is 7-char Base62 (3.5 trillion IDs), dropping collision probability to under 0.0036% at 500M records.

Step 2 — Counter vs hash. A counter produces predictable IDs, which is fine for URLs (they are meant to be public). A hash produces random IDs, making the system harder to enumerate but requiring a collision check on every write. For a URL shortener without sensitive ID requirements, a counter with Bloom filter is simpler and safer.

Step 3 — Multi-master counter coordination. A single database counter is a single point of failure. The Flickr Ticket Server pattern solves this: each app server pre-allocates a range of counter values (app-server-1 holds 1–100,000 in memory). When it exhausts its range, it requests a new block atomically. This eliminates per-request DB coordination while maintaining global uniqueness.

Step 4 — Base choice. URL shortener IDs are machine-copied — users click links, they do not type them. Base62 is correct over Base58: 62^7 = 3.5 trillion vs 58^7 = 2.2 trillion. More IDs per character. Case-sensitivity is acceptable because HTTP clients preserve URL case.

Step 5 — Read path caching. At 100:1 read:write ratio, read performance dominates. Short-ID-to-URL mappings live in a Redis cluster. 500M URLs × 200 bytes per entry ≈ 100GB — fits in a Redis cluster with hot IDs served from L1 memory. Cache miss triggers a DB lookup and a cache warm.

Decision	Option A	Option B	Chosen
ID length	6 chars (56.8B namespace)	7 chars (3.5T namespace)	7 chars
Generation	Hash-based (random)	Counter + ticket server	Counter + ticket server
Base	Base58 (human-safe)	Base62 (denser namespace)	Base62
Collision check	DB only (slower)	Bloom filter + DB	Bloom filter + DB

⚖️ Random vs Time-Ordered: The Tradeoff That Breaks B-Tree Indexes in Production

The most common production ID mistake is using UUID v4 as a primary key and discovering the performance cliff at 50–100M rows — by which point a schema migration is painful.

Why random UUIDs hurt B-tree indexes. A B-tree is a sorted structure. Every UUID v4 insert targets a random page anywhere in the tree. At 100M rows with 16KB pages, the working set of "hot" pages is a tiny fraction of the total tree. Nearly every insert page-faults. The database reads a cold page from disk, modifies it, and writes it back. This is the definition of write amplification on random I/O.

Why time-ordered IDs fix it. If every new ID is larger than all existing IDs (monotonically increasing within the same millisecond), every insert appends to the rightmost page. That page is almost always in the buffer pool. Inserts become near-sequential writes — dramatically more efficient and far kinder to SSD wear patterns.

The privacy/security tradeoff. Time-ordered IDs reveal when a record was created and allow inference about creation rate. For internal order IDs or audit log entries, this is completely acceptable. For externally-visible user IDs, it is a mild privacy concern. For session tokens or API keys, it is a security risk — never use time-ordered IDs for anything that must be unguessable.

The two-ID pattern. Generate an internal Snowflake ID (fast inserts, sortable for queries) and a separate external random token (UUID v4 or NanoID, opaque to callers) for user-facing identifiers. Stripe uses this pattern: every charge has an internal ledger ID (sortable, used for ordering) and a public charge ID (ch_xxx, opaque, used in APIs and webhooks). GitHub uses it for repository IDs. Twilio uses it for every resource.

Scenario	UUID v4	Snowflake or UUID v7
DB insert throughput at 500M rows	Degrades ~70% vs baseline	Stays near baseline
Time-sortable queries	❌	✅
Reveals creation time	No	Yes
Safe for session tokens	✅	❌ — never use time-ordered tokens
Safe for order / audit IDs	❌ — fragmented index	✅
ID storage overhead	16 bytes	8 bytes (Snowflake)

🧭 Choosing Your ID Strategy: Decision Flowchart and Use-Case Matrix

The right strategy is determined by your constraints. Start from length, then sortability, then coordination model. This flowchart covers 95% of real-world decisions.

flowchart TD
    A[What are your constraints?] --> B{Need short human-shareable IDs? 8 chars}
    B -->|Yes| C{Sortability required?}
    C -->|Yes| D[ULID or KSUID 2627 chars sortable]
    C -->|No| E[Base62 counter + Bloom filter]
    B -->|No  internal or longer OK| F{Need time-sorted IDs for DB performance?}
    F -->|Yes| G{Multi-node distributed?}
    G -->|Yes| H[Snowflake ID or UUID v7]
    G -->|No  single DB| I[UUID v7 or auto-increment]
    F -->|No sort needed| J{Content deduplication?}
    J -->|Yes| K[Truncated SHA-256  Base62]
    J -->|No| L{Security token? Must be unguessable}
    L -->|Yes| M[NanoID or UUID v4]
    L -->|No| N[UUID v4 is sufficient]

When two strategies appear equal on the flowchart, prefer the one your database or ORM natively supports — operational familiarity reduces incident blast radius.

Use-case decision table:

Use Case	Recommended Strategy	Core Reason
URL shortener (public)	Base62 counter + Bloom filter	Short, URL-safe, collision-controlled with Bloom guard
User IDs (global SaaS)	UUID v7 or Snowflake	Sortable for queries, no central coordinator required
Order IDs (internal payments)	Snowflake	Compact int64, time-ordered for ledger scans, high throughput
Session tokens / API keys	NanoID or UUID v4	Unpredictable, URL-safe, no ordering leak
Content-addressed cache keys	SHA-256 truncated → Base62	Deterministic, free deduplication, idempotent writes
Referral / gift card codes	Base58	Removes 0/O/I/l confusion; zero mistype errors
Audit / event log IDs	ULID	Time-ordered, readable, URL-safe, case-insensitive
Distributed trace IDs	UUID v4 or Snowflake	Zero coordination needed, generated at microsecond scale

🧪 Sizing Your Namespace — The Birthday Paradox Applied to ID Collision Risk

The birthday paradox reveals a counterintuitive truth: you do not need to fill a namespace before collisions become likely. You need roughly √(namespace_size) records before the collision probability exceeds 50%.

The approximation formula is: P(collision) ≈ n² / (2 × namespace_size)

Where n is the total number of IDs generated.

Example: Base62 Namespace Collision Probability by Record Count

ID Length	Namespace Size	At 10M records	At 100M records	At 500M records	At 1B records
6 chars (Base62)	56.8B	0.00088%	0.088%	2.2%	8.8%
7 chars (Base62)	3.5T	0.0000143%	0.00143%	0.036%	0.143%
8 chars (Base62)	218T	0.00000023%	0.000023%	0.00057%	0.0023%

The 6-char Base62 row reveals why the URL shortener must use 7 characters. At 500M records, a 6-char namespace has a 2.2% collision probability — unacceptable for a production system where every collision is a wrong redirect.

Sizing rule: if your expected record count is more than √(namespace_size / 2), add one character to your ID length. For 500M records, √(56.8B / 2) ≈ 169K — 500M is far beyond this threshold for a 6-char Base62 namespace. Move to 7 chars where √(3.5T / 2) ≈ 1.3M — still comfortably above 500M, which is why the Bloom filter + unique constraint defense-in-depth approach remains mandatory.

A Bloom filter pre-check on the write path reduces collision detection DB roundtrips by ~99.9%, but it is not the safety net — the DB unique constraint is. Size your namespace first, then add the Bloom filter, then enforce the unique constraint. These are three independent layers of correctness, not alternatives to each other.

🛠️ hashids: Human-Friendly ID Obfuscation for Database-Backed Systems

hashids is an open-source library (available in 30+ languages) that encodes integers into short shuffled strings using a custom alphabet and a salt. It is not a cryptographic primitive — it is an obfuscation tool that prevents casual enumeration of sequential database IDs.

Unlike plain Base62 encoding, hashids applies a permutation based on your salt so that sequential integers (1, 2, 3) do not produce sequential strings. This hides business scale: a competitor examining https://paste.example.com/W7eT cannot easily determine that this corresponds to database row 12,345,678 or infer the total number of pastes.

Using hashids is straightforward: a Hashids instance is initialized with a salt string, a minimum output length, and a custom alphabet. Calling encode() with any integer produces a shuffled string; calling decode() on that string returns the original integer. The mapping is fully deterministic — the same integer always produces the same string for a given salt and alphabet — and reversible without a database lookup, which makes it suitable for building stateless short-URL redirectors where the full destination can be recovered from the ID alone.

hashids is deterministic (same input + salt + alphabet → same output) and reversible without a database lookup. It is a good fit for platforms that expose sequential database IDs externally and want basic obfuscation without maintaining a separate mapping table. Pastebin-style platforms, YouTube-style video IDs, and short URL generators use this pattern.

The key caveat: hashids is reversible by anyone who knows your salt and alphabet. It is obfuscation, not security. Never use it for tokens that must be unguessable.

For a companion deep-dive on content-addressed storage and deterministic ID generation strategies, a follow-up post is planned in this series.

🛠️ Redis INCR + Base62 — The Ticket Server Pattern for Atomic Short ID Generation

Redis provides an atomic INCR command that increments a counter key and returns the new value in a single operation. No application-level locking is needed — Redis's single-threaded command execution guarantees that no two clients ever receive the same counter value.

Redis's atomic INCR command returns the post-increment value of the counter key in a single operation, guaranteeing that no two concurrent callers ever receive the same integer. The application layer takes the returned integer — for example, 12345678 — encodes it in Base62 to produce a short string such as W7eT, and stores the mapping from that short string to the original URL in the primary datastore.

For multi-region systems, run one Redis counter per region with a regional prefix or a large offset in the counter start (region-A starts at 0, region-B starts at 10¹², region-C starts at 2×10¹²). This eliminates cross-region coordination while preserving global uniqueness.

For durable ID generation, persist the counter (appendonly yes in Redis config) or use a PostgreSQL sequence (NEXTVAL). A Redis instance that restarts without persistence resets the counter to 0, causing ID reuse — a subtle and dangerous production bug. The ticket server pattern with pre-allocated ranges in application memory (each server holds a block of 10,000 IDs) adds a buffer against counter service downtime without compromising uniqueness.

For an architectural deep-dive on Redis as a coordination primitive, see the planned Redis Internals post in the How It Works: Internals Explained series.

📚 What Real Systems Got Wrong With ID Generation

Never expose auto-increment DB IDs externally. Sequential IDs reveal your total record count and creation rate. In 2012, a researcher enumerated Facebook profile IDs by incrementing from 1 and derived total user count data before a financial disclosure. Use an opaque encoding (hashids, Base62 shuffle, or UUID v4) for any ID that crosses the API boundary.

Always size your namespace 100× larger than your maximum expected record count. The birthday paradox means collision probability rises sharply once you pass √(namespace_size) records. Planning for 500M URLs means designing for a 50-billion-plus namespace — 7-char Base62 — not just "will it fit."

Time-ordered IDs dramatically outperform random UUIDs for database inserts. If you have an existing system with UUID v4 primary keys and inserts are degrading, migrating to UUID v7 is the lowest-friction fix — same 16-byte wire format, same column type, no schema migration, just a UUID version change. Percona reports 3–5× insert throughput improvement at 500M+ rows.

For human-entered codes, Base58 is mandatory, not optional. Confusing 0 with O or I with l causes measurable support incidents. E-commerce platforms that shipped gift card and referral code systems using Base62 without filtering ambiguous characters documented 2–5% redemption failure rates attributable to mistyping alone.

A Bloom filter pre-check is a throughput optimization, not the collision safety net. Bloom filters can only say "definitely not seen" or "possibly seen." After a cold start, the filter is empty and every ID looks new. The database unique constraint is the authoritative check. Treat the Bloom filter as a performance layer, not a correctness layer.

The two-ID pattern is the production-grade solution for IDs that are both fast and safe. Internal Snowflake or UUID v7 ID for database performance. Separate external random token (NanoID, UUID v4) for anything user-facing or security-sensitive. This is how Stripe, GitHub, and Twilio design their resource identifiers.

📌 ID Strategy Decision Cheat Sheet

Short shareable URLs: Base62 counter + Bloom filter + unique constraint. Use 7 chars for 500M+ records — 6 chars gives 2%+ collision probability at that scale.
Database primary keys at high volume: UUID v7 or Snowflake. UUID v4 causes B-tree fragmentation at 50M+ rows — this is a documented production problem, not a theoretical concern.
Global distributed IDs, no coordinator: Snowflake (compact int64, sortable, 4M IDs/ms across 1,024 nodes) or ULID (26-char string, sortable, URL-safe, case-insensitive).
Security tokens and session IDs: UUID v4 or NanoID — random, unpredictable, URL-safe. Never use time-ordered IDs for anything that must be unguessable.
Human-entered codes: Base58. Removing the four ambiguous characters (0, O, I, l) is the difference between a 0% and a 3% support ticket rate.
Content deduplication: Truncated SHA-256 in Base62. Deterministic, free deduplication. Still requires a DB unique check — hash collisions are rare but real.
The two-ID pattern: Snowflake internally, random token externally. This is what Stripe, GitHub, and Twilio do, and it solves the entire DB performance vs opacity tradeoff.

How Bloom Filters Work: The Probabilistic Set — the data structure powering the collision pre-check on every URL shortener write path
System Design HLD: URL Shortener — the full HLD walkthrough that applies the ID strategy, caching, and sharding decisions from this post
Consistent Hashing Explained — the complementary routing strategy that determines which shard holds a given ID
System Design Sharding Strategy — how your ID strategy and shard key selection interact at database scale

Expandable deep dives

📖 The Uniqueness Problem Hiding Inside Every Production System⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

🔍 What "Base" Actually Means — Encoding Numbers Into Shorter Strings⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

⚙️ How Each ID Strategy Generates Uniqueness at Scale⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Auto-Increment + Base Encoding⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Key takeaways

✓TLDR: Short shareable IDs need Base62 (URL shorteners).
✓Database primary keys at scale need time ordered IDs (Snowflake, UUID v7).
✓Security tokens need random IDs (UUID v4, NanoID).
✓Picking the wrong strategy either causes B tree fragmentation at 50M+ rows, production collisions, or IDs that leak your business metrics to competitors.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Reader feedback

Was this article useful?

Rate it before you leave, then follow or subscribe for the next deep dive.

Continue learning

Stale Reads and Cascading Failures in Distributed Systems

25 min · System Design · best next step

View roadmap