System Design HLD Example: URL Shortener (TinyURL and Bitly)
A practical interview-ready HLD for a short-link platform with heavy read traffic.
Abstract AlgorithmsTLDR: Design a URL shortener like TinyURL or Bitly. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive.
TLDR: A URL shortener converts long links into compact aliases while serving extremely fast redirect reads at scale.
Twitter's early URL sharing created a real problem: long URLs broke tweet character limits and concealed link destinations from users. A URL shortener looks deceptively simple until you need to serve 100 million redirects per day with sub-10ms latency and guarantee zero short-code collisions under 10K writes per second โ at that point a naive random string generator with a uniqueness check becomes a database bottleneck.
Understanding how to design a URL shortener teaches you core patterns shared by nearly every high-scale read-heavy system: ID generation strategies, cache-first redirect paths, database sharding, and the write amplification trade-off between counter-based and hash-based code generation.
By the end of this walkthrough you'll know why a base62-encoded auto-increment counter beats MD5 hashing (no collision detection loop required, shorter output codes), why a Redis redirect cache in front of your database is non-negotiable at 10K RPS (over 99% of traffic is reads, not writes), and why you'd shard the URL mapping table by hash(short_code) rather than creation timestamp to avoid hot partition accumulation.
๐ Use Cases
Actors
| Actor | Role |
| End user | Clicks a short link; expects instant redirect with no visible latency |
| Link creator | Submits a long URL via API or dashboard; receives a 6-char short code |
| Brand / marketer | Requests a custom alias (vanity slug) for campaign tracking |
| Analytics consumer | Reads aggregated click data: counts, referrer, geo, device |
| Platform service | Enforces rate limits, spam detection, and expiry eviction |
Use Cases
- Primary interview prompt: Design a URL shortener like TinyURL or Bitly.
- Core user journeys:
- Create short link โ POST a long URL; system generates a unique 6-char Base62 code and returns it
- Redirect by code โ GET
/{shortCode}resolves to the original URL and issues a 301/302 redirect - Optional custom alias โ creator specifies a preferred slug (e.g.,
bit.ly/launch-2025); system checks for collision - Expiry support โ creator sets a TTL or hard expiry date; expired codes return 410 Gone
- Click analytics โ every redirect emits an async event capturing timestamp, referrer, IP geo-hash, and user-agent family
- Read and write paths are explained separately so bottlenecks and consistency boundaries are explicit.
This template starts with actors and use cases because architecture only makes sense when user behavior and workload shape are clear. In interviews, this section prevents random tool selection and keeps the answer grounded in business outcomes.
๐ Functional Requirements
In Scope
- Short link creation โ POST
/shortenwith long URL; returns short code with idempotency key support - Redirect โ GET
/{shortCode}with cache-first lookup; sub-10ms p95 on cache hit - Custom alias โ optional user-defined slug with uniqueness validation
- Expiry โ TTL-based and hard-date expiry; expired codes return
410 Gone - Click analytics โ async event per redirect (timestamp, referrer, geo-hash, device family)
Out of Scope (v1 boundary)
- Real-time analytics dashboards โ async pipeline is built; UI is out of scope
- Full global active-active writes across every region
- A/B test link variants or personalized redirects
- QR code generation (can be a thin adapter on top of the short code)
Functional Breakdown
| Feature | API Contract | Key Decision |
| Create short link | POST /shorten โ { shortCode, shortUrl, expiry } | Base62(INCR counter) vs MD5 hash |
| Redirect by code | GET /{shortCode} โ 302 Location: <longUrl> | Redis cache-first; DB fallback |
| Custom alias | POST /shorten with alias field | Collision check on writes; reject or retry |
| Expiry support | expiresAt field on create; eviction job or Redis TTL | Soft delete vs hard TTL |
| Click analytics | Fire-and-forget Kafka event on every redirect | Async โ never blocks the redirect path |
Initial building blocks:
- Code generation service โ Redis
INCRcounter + Base62 encoding โ guaranteed-unique 6-char codes - Redirect API โ stateless HTTP layer; cache-first lookup with DB fallback and write-through warm-up
- Metadata store โ PostgreSQL table
(short_code PK, long_url, created_at, expires_at, owner_id) - Cache tier โ Redis
GET redirect:{code}โ 99%+ hit rate on hot codes; TTL mirrors link expiry - Async analytics pipeline โ Kafka topic
url.redirect.eventsโ Flink consumer โ ClickHouse aggregation
A strong answer names non-goals explicitly. Interviewers use this to judge prioritization quality and architectural maturity under time constraints.
โ๏ธ Non Functional Requirements
| Dimension | Target | Why it matters for a URL shortener |
| Scalability | 10K writes/sec; 100K redirects/sec peak | Read:write ratio โ 100:1 โ scale read and write paths independently |
| Availability | 99.99% on the redirect path; 99.9% on creation | A broken redirect is user-facing failure; creation can tolerate brief outages |
| Performance | Redirect p95 < 10ms (cache hit) / < 50ms (DB fallback); creation p95 < 100ms | Users expect near-instant redirects |
| Consistency | Short code โ URL: strong (creation must be globally unique); click counts: eventual | Correctness on writes; throughput on analytics reads |
| Operability | Cache hit ratio, redirect p95/p99, Kafka consumer lag, DB row growth | Core SLO signals that drive scaling decisions |
Non-functional requirements are where many designs fail in practice. Naming measurable targets and coupling architecture decisions to those targets is far more useful than listing technologies.
๐ง Deep Dive: Estimations and Design Goals
The Internals: Data Model
The primary table is url_mappings. Every short code โ whether auto-generated or a custom alias โ maps to exactly one long URL with optional expiry.
-- PostgreSQL schema; table is sharded by hash(short_code) % N
CREATE TABLE url_mappings (
short_code VARCHAR(12) PRIMARY KEY, -- 6 chars (auto) or up to 12 (alias)
long_url TEXT NOT NULL,
owner_id UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ, -- NULL means no expiry
deleted_at TIMESTAMPTZ, -- NULL means active; set on expiry/delete
is_custom BOOLEAN NOT NULL DEFAULT FALSE
);
-- Expiry worker uses this index to batch-find expired codes efficiently
CREATE INDEX idx_expires ON url_mappings (expires_at)
WHERE expires_at IS NOT NULL AND deleted_at IS NULL;
Why
VARCHAR(12)and notVARCHAR(6)? Auto-generated codes are always 6 chars. Custom aliases can be up to 12. One column covers both without a separate table.
Redis key layout:
| Key pattern | Value | TTL | |
redirect:{short_code} | `{longUrl} | {expiresEpoch}` packed string | Mirrors expires_at; 30 days if no expiry |
counter:url_shortener | Auto-increment integer | No TTL โ persists forever | |
alias_reserved:{slug} | "1" (existence flag) | Mirrors alias expiry; deleted when link expires |
Why store
expiresEpochin the Redis value? The redirect service can check expiry without a DB round-trip: parse the epoch from the cached value, compare toSystem.currentTimeMillis(), and return 410 if expired โ even on a cache hit.
Estimations
Assumptions for a Bitly-scale service:
| Dimension | Assumption | Derivation |
| Link creations | 100M new links/day | โ 1,160 writes/sec steady; 10K/sec burst |
| Redirects | 10B/day | โ 115K reads/sec steady; 500K/sec burst |
| Read : Write ratio | ~100 : 1 | Drives cache-first architecture |
| URL mapping row size | ~500 bytes | short_code (7B) + long_url (avg 200B) + metadata |
| Storage growth | 50 GB/day | 100M ร 500B; 1-year retention โ 18 TB |
| Hot key cache | ~10 GB Redis | Top 1% of codes handle 80% of traffic (Zipf distribution) |
| Kafka throughput | 115K events/sec | One event per redirect; avg message size ~200B |
Key insight: >99% of traffic is reads. The redirect cache hit ratio is the single most important operational metric โ a 1% drop in hit rate at 115K RPS means 1,150 extra DB queries per second.
Design Goals
| Goal | Why it matters | Design decision it drives |
| Code uniqueness is non-negotiable | Two codes mapping to different URLs breaks user trust permanently | INCR counter over MD5 โ no collision retry loop ever needed |
| Redirect must never block on analytics | Click recording is best-effort; a blocking Kafka publish must not delay the 302 response | Fire-and-forget async publish โ never await the Kafka future |
| Cache hit rate โฅ 99% | At 115K redirects/sec, a 2% miss rate = 2,300 DB reads/sec โ PostgreSQL saturation | Write-through on creation; LRU with enough memory for the hot 1% of codes |
| Expired codes must return 410, not 404 | 404 = "never existed"; 410 = "existed but gone" โ important for SEO crawlers and link-rot debugging | Soft-delete with deleted_at; expiry check on every response |
| Custom alias collisions must be rejected | Silent alias overwrite routes users to the wrong destination | Atomic INSERT ... WHERE NOT EXISTS; return 409 Conflict if the slug is taken |
Performance Analysis
| Pressure point | Symptom | First response | Second response |
| Hot partitions | Tail latency spikes | Key redesign | Repartition by load |
| Cache churn | Miss storms | TTL and key tuning | Multi-layer caching |
| Async backlog | Delayed downstream work | Worker scale-out | Priority queues |
| Dependency instability | Timeout cascades | Fail-fast budgets | Degraded fallback mode |
URL shortener health signals โ the five metrics that matter most:
| Metric | Alert threshold | What it reveals |
redirect.cache_hit_ratio | < 99% | Redis memory pressure, cold-start spike, or cache poisoning |
redirect.p95_latency_ms (cache path) | > 15ms | Slow Redis connection pool or Lettuce thread starvation |
redirect.p95_latency_ms (DB fallback) | > 80ms | PostgreSQL shard hot-spot or connection pool exhaustion |
kafka.consumer_lag (click events) | > 10K events | Flink under-provisioned; analytics freshness degrading |
expiry_worker.deleted_per_run | > 500K rows | Expiry backlog growing; increase worker frequency or batch size |
๐ High Level Design - Architecture for Functional Requirements
Building Blocks
| Component | Responsibility | Technology choice |
| API Gateway | Rate limiting, auth, TLS termination | Nginx / AWS API Gateway |
| Shortener Service | Code generation, alias validation, DB write, cache warm-up | Spring Boot stateless pods |
| Redirect Service | Cache-first lookup, 302 response, click event emit | Spring Boot (ultra-low latency) |
| Metadata Store | Durable (short_code โ long_url, expiry, owner) mapping | PostgreSQL, sharded by hash(short_code) |
| Cache Tier | Hot-path redirect resolution; TTL mirrors link expiry | Redis Cluster |
| Analytics Pipeline | Per-redirect click events โ aggregated metrics | Kafka โ Flink โ ClickHouse |
| Expiry Worker | Soft-delete expired codes; evict from cache | Scheduled Spring Batch job |
Design the APIs
Create short link:
POST /shorten
Content-Type: application/json
Idempotency-Key: <uuid>
{
"longUrl": "https://example.com/very/long/path?ref=campaign&utm_source=email",
"alias": "launch-2025", // optional custom slug
"expiresAt": "2025-12-31T23:59:59Z" // optional TTL
}
HTTP 201 Created
{
"shortCode": "aB3kZ9",
"shortUrl": "https://tiny.example.com/aB3kZ9",
"expiresAt": "2025-12-31T23:59:59Z"
}
Redirect:
GET /aB3kZ9
HTTP 302 Found
Location: https://example.com/very/long/path?ref=campaign&utm_source=email
Click analytics:
GET /analytics/aB3kZ9?from=2025-01-01&to=2025-01-31
HTTP 200 OK
{
"shortCode": "aB3kZ9",
"totalClicks": 84201,
"topReferrers": ["twitter.com", "email", "direct"],
"topCountries": ["US", "IN", "GB"],
"clicksPerDay": [{ "date": "2025-01-01", "count": 3120 }, ...]
}
Key design decisions:
Idempotency-Keyheader on POST ensures safe retries without duplicate code creation301 Permanentvs302 Temporaryredirect: use 302 so analytics can track future clicks (301 is cached by browsers and bypasses the shortener)- Analytics API reads from ClickHouse (eventual), not PostgreSQL (strong) โ latency vs freshness trade-off is explicit
Communication Between Components
| Path | Protocol | Why |
| Client โ Redirect Service | HTTPS (synchronous) | User-facing; must be fast and reliable |
| Redirect Service โ Redis | TCP (synchronous) | Cache hit exits here โ no DB touch |
| Redirect Service โ PostgreSQL | JDBC (synchronous, fallback only) | Cold miss; DB read + write-through back to Redis |
| Redirect Service โ Kafka | Async publish (fire-and-forget) | Click event; never blocks the redirect response |
| Kafka โ Flink โ ClickHouse | Streaming | Analytics aggregation; eventual consistency acceptable |
| Expiry Worker โ PostgreSQL + Redis | Batch | Nightly soft-delete of expired codes; Redis key eviction |
Data Flow
Write path (create short link):
Client
โ POST /shorten
โ Shortener Service
โ Redis INCR url:counter (atomic, globally unique ID)
โ Base62 encode(id) (6-char short code)
โ PostgreSQL INSERT (durable mapping)
โ Redis SET redirect:{code} (write-through cache warm-up)
โ 201 { shortCode, shortUrl }
Read path (redirect):
Client
โ GET /{shortCode}
โ Redirect Service
โ Redis GET redirect:{shortCode} (cache hit: ~99% of requests)
โ 302 redirect immediately
โ PostgreSQL SELECT (cache miss only)
โ Redis SET (write-through)
โ 302 redirect
โ Kafka publish click event (async, fire-and-forget)
Analytics path:
Kafka topic: url.redirect.events
โ Flink streaming job
โ Aggregate: clicks per code, referrer, geo
โ ClickHouse (columnar, append-optimised)
โ Analytics API (GET /analytics/{code})
flowchart TD
A[Client] -->|POST /shorten| B[Shortener Service]
B -->|INCR + Base62| C[Redis Counter]
B -->|INSERT mapping| D[PostgreSQL]
B -->|SET redirect:code| E[Redis Cache]
B -->|201 shortCode| A
F[Client] -->|GET shortCode| G[Redirect Service]
G -->|GET redirect:code| E
E -->|cache hit 99%| H[302 Redirect]
G -->|cache miss| D
D --> G
G -->|fire-and-forget| I[Kafka]
I --> J[Flink โ ClickHouse]
๐ Real-World Applications: API Mapping and Real-World Applications
This architecture pattern powers multiple production systems. Here is how each URL shortener feature maps to the design:
| Feature | Real-world example | Design element |
| Create short link | Bitly API, Twitter t.co auto-shortening | POST /shorten โ INCR + Base62 |
| Redirect | All short links โ bit.ly/xyz โ destination | Redis cache-first GET + 302 |
| Custom alias | ow.ly/brand-launch campaign links | Alias field + collision check |
| Expiry | QR codes for time-limited promotions | expiresAt + eviction worker |
| Click analytics | Bitly dashboard, campaign UTM tracking | Kafka โ Flink โ ClickHouse |
301 vs 302 โ a decision that affects revenue: Twitter's t.co and Bitly both use 302 (temporary redirect) despite the permanent nature of most links. A 301 would be cached by browsers, bypassing the shortener entirely on repeat visits โ which means zero analytics and zero ability to update the destination URL after creation. The performance cost of the extra hop is 10โ50ms; the business value of every trackable click justifies it.
๐ Feature Deep Dive: Custom Alias
A custom alias is a user-chosen slug (e.g., bit.ly/launch-2025) instead of the auto-generated 6-char code. The challenge is collision detection at write time without a race condition.
The naive approach fails: read-then-write has a race window where two callers check the same slug simultaneously, both see "available", and both succeed โ overwriting each other.
The correct approach: use a database-level unique constraint and let the INSERT fail atomically:
public ShortenResult shorten(ShortenRequest req) {
String shortCode;
if (req.alias() != null) {
shortCode = validateAlias(req.alias()); // length, charset check
} else {
Long id = redis.opsForValue().increment("counter:url_shortener");
shortCode = base62Encode(id);
}
try {
// PostgreSQL enforces uniqueness via PRIMARY KEY โ no race condition possible
database.save(new UrlMapping(shortCode, req.longUrl(), req.expiresAt(), req.alias() != null));
} catch (DataIntegrityViolationException e) {
// Alias was taken between our check and the insert โ surface a clean 409
throw new AliasTakenException(shortCode);
}
// Write-through cache warm-up
String cacheValue = buildCacheValue(req.longUrl(), req.expiresAt());
redis.opsForValue().set("redirect:" + shortCode, cacheValue, ttlFor(req.expiresAt()));
if (req.alias() != null) {
// Reserve alias flag for fast collision check on future creates
redis.opsForValue().set("alias_reserved:" + shortCode, "1", ttlFor(req.expiresAt()));
}
return new ShortenResult(shortCode, buildShortUrl(shortCode));
}
Custom alias rules enforced at the API layer (before the DB hit):
- Length: 4โ12 characters
- Charset:
[A-Za-z0-9-_]only โ no spaces, no special characters - Blocklist: reserved slugs like
api,analytics,admin,healthare rejected with 422
โฐ Feature Deep Dive: Expiry Support
Expiry has two independent halves: serving expired codes correctly and cleaning up expired rows in the background.
Serving Expired Codes
The redirect service must detect expiry even on a cache hit โ because the Redis TTL may not have fired yet due to lazy eviction. The solution is to embed the expiry epoch inside the cached value:
// Cache value format: "https://example.com/long/url|1735689600"
// long URL | expires_at epoch (0 = no expiry)
public RedirectResult resolve(String shortCode) {
String cached = redis.opsForValue().get("redirect:" + shortCode);
if (cached != null) {
String[] parts = cached.split("\\|", 2);
String longUrl = parts[0];
long expiresEpoch = parts.length > 1 ? Long.parseLong(parts[1]) : 0;
if (expiresEpoch > 0 && Instant.now().getEpochSecond() > expiresEpoch) {
// Code is cached but has expired โ evict and return 410
redis.delete("redirect:" + shortCode);
throw new LinkExpiredException(shortCode);
}
return new RedirectResult(longUrl);
}
// Cache miss: check DB
return database.findActive(shortCode)
.map(m -> {
redis.opsForValue().set("redirect:" + shortCode, buildCacheValue(m), ttlFor(m.expiresAt()));
return new RedirectResult(m.longUrl());
})
.orElseThrow(() -> new ShortCodeNotFoundException(shortCode));
}
Background Expiry Worker
Redis TTL handles eviction from the cache layer automatically. PostgreSQL requires a background job to soft-delete expired rows:
-- Runs every 15 minutes via Spring Batch or a cron job
-- Processes in batches of 10K to avoid long-running transactions
UPDATE url_mappings
SET deleted_at = NOW()
WHERE expires_at < NOW()
AND deleted_at IS NULL
LIMIT 10000;
Why soft-delete instead of hard DELETE? Soft-delete preserves the row for audit trails, abuse investigations, and link-rot debugging. Hard DELETE runs after a 90-day grace period.
Expiry lifecycle summary:
Link created with expiresAt=T
โ PostgreSQL row: expires_at=T, deleted_at=NULL
โ Redis key: TTL set to (T - now) seconds
At time T:
โ Redis: may evict lazily (TTL fires) or on next access (active check)
โ PostgreSQL: still has expires_at=T, deleted_at=NULL (background worker hasn't run yet)
Redirect request arrives at T+5min (before worker runs):
โ Cache hit: epoch check detects expiry โ 410, evict key
โ Cache miss: DB query finds row, expires_at < NOW() โ 410
Background worker runs at T+15min:
โ Sets deleted_at=NOW() on all expired rows
โ Hard DELETE runs at T + 90 days
๐ Feature Deep Dive: Click Analytics Pipeline
Every redirect emits a lightweight event to Kafka. The analytics pipeline is intentionally decoupled from the redirect path โ a Flink cluster failure should never affect redirect latency.
Kafka Event Schema
{
"eventType": "url.redirect",
"shortCode": "aB3kZ9",
"timestamp": "2025-01-15T14:23:11.442Z",
"referrer": "twitter.com",
"userAgent": "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0)",
"deviceFamily": "mobile",
"countryCode": "US",
"ipGeoHash": "9q8yy"
}
ipGeoHashinstead of raw IP: geo-hashed to cell-level (โ5kmยฒ) at the redirect service before emission โ raw IPs never leave the request context.- Topic partitioning key:
shortCodeโ ensures all events for the same code land on the same Flink operator, enabling stateful aggregation without a shuffle.
Flink Aggregation Job
Kafka source (url.redirect.events)
โ KeyBy(shortCode)
โ TumblingWindow(1 minute)
โ Aggregate: clickCount, topReferrers (top-K), topCountries (top-K)
โ Sink โ ClickHouse table: click_events_by_minute
ClickHouse:
CREATE TABLE click_events_by_minute (
short_code String,
window_start DateTime,
click_count UInt64,
top_referrers Array(String),
top_countries Array(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(window_start)
ORDER BY (short_code, window_start);
Why ClickHouse? It's a columnar OLAP store purpose-built for aggregation queries over append-only event data. A query like "total clicks for aB3kZ9 in January" scans only the click_count column for the matching short_code โ not every column in every row.
โ๏ธ Trade-offs & Failure Modes (Design Deep Dive for Non Functional Requirements)
Scaling Strategy
| Bottleneck | Symptom | Fix |
| Redis INCR single instance | INCR latency spikes at > 500K writes/sec | Partition counter into 1024 shards; append shard ID to code |
| PostgreSQL write hotspot | INSERT p99 grows as table crosses 1B rows | Shard by hash(short_code) % N; never by timestamp |
| Cache eviction churn | Redirect miss rate climbs | Increase Redis memory; tune maxmemory-policy allkeys-lru |
| Kafka consumer lag | Click events delayed > 60s | Add Flink parallelism; scale consumer group partitions |
Availability and Resilience
- Redirect Service is stateless โ horizontal scale behind a load balancer; any pod can serve any request
- Redis failure fallback โ if Redis is unreachable, fall back to PostgreSQL directly; accept latency degradation, not downtime
- PostgreSQL replication โ synchronous replica for failover; read replicas for analytics queries (not the redirect path)
- Circuit breaker โ wrap PostgreSQL calls with Resilience4j; open circuit after 5 consecutive failures; return 503 rather than hanging
Storage and Caching
| Layer | What it stores | Eviction policy |
| Redis (hot) | redirect:{code} โ longUrl for top 1% of codes | LRU + per-key TTL matching link expiry |
| PostgreSQL | All URL mappings; source of truth | Soft-delete on expiry; hard-delete after 90-day grace |
| ClickHouse | Aggregated click events per code/day | Partition by month; drop partitions > 2 years |
Consistency, Security, and Monitoring
Consistency model by operation:
- Short code creation โ strong consistency; PostgreSQL write must succeed before returning
shortCodeto caller - Redirect resolution โ eventual (Redis) to strong (PostgreSQL fallback); stale-while-revalidate acceptable for non-expired codes
- Click counts โ eventual; Flink aggregation lag of 5โ30 seconds is acceptable
Security:
- Rate-limit
POST /shortenper API key (100 req/min) to prevent code-space exhaustion attacks - Validate destination URLs against a phishing/malware blocklist on creation
- Encrypt
owner_idin the mapping table; do not expose creator identity via the redirect path
Key SLO signals to monitor:
redirect.cache_hit_ratioโ alert if drops below 98%redirect.p95_latency_msโ alert if cache-hit p95 exceeds 15mskafka.consumer_lagโ alert if click event lag exceeds 60spostgres.connection_pool_utilizationโ alert above 70% to catch cold-miss storms early
๐งญ Decision Guide
| Situation | Recommendation |
| Base62 vs MD5 for code generation | Base62(INCR) โ no collision loop, shorter codes, monotonic |
| 301 vs 302 redirect | 302 โ preserves analytics; 301 breaks click tracking |
| Cache write-through vs cache-aside | Write-through โ warm cache on creation; first redirect never hits DB |
| Sharding by timestamp vs short_code | Shard by hash(short_code) โ avoids hot recent-writes partition |
| Strong vs eventual for analytics | Eventual โ click counts at Flink lag (5โ30s) is acceptable for dashboards |
| Custom alias collision strategy | Reject with 409 + suggest alternatives; never silent override |
๐งช Practical Example for Interview Delivery
A repeatable way to deliver this design in interviews:
- Start with actors, use cases, and scope boundaries.
- State estimation assumptions (QPS, payload size, storage growth).
- Draw HLD and explain each component responsibility.
- Walk through one failure cascade and mitigation strategy.
- Describe phase-based evolution for 10x traffic.
Question-specific practical note:
- Use cache-first redirects, write-through mapping, and background click aggregation to protect the primary database.
A concise closing sentence that works well: "I would launch with this minimal architecture, monitor p95 latency, error-budget burn, and queue lag, then scale the first saturated component before adding further complexity."
๐๏ธ Advanced Concepts for Production Evolution
When interviewers ask follow-up scaling questions, use a phased approach:
- Stabilize critical path dependencies with better observability.
- Increase throughput by isolating heavy side effects asynchronously.
- Reduce hotspot pressure through key redesign and repartitioning.
- Improve resilience using automated failover and tested runbooks.
- Expand to multi-region only when latency, compliance, or reliability targets require it.
This framing demonstrates that architecture decisions are tied to measurable outcomes, not architecture fashion trends.
๐ ๏ธ Atomic INCR, Base62 Encoding, and Write-Through Cache: Three Decisions on One Critical Path
Three architectural decisions define the URL shortener's write and read paths:
- Redis
INCRfor ID generation โ atomic across all instances, no collision detection loop needed - Base62 encoding โ converts the integer to a 6-character code (62โถ โ 56 billion unique codes) rather than MD5, which requires a collision detection loop and produces longer outputs
- Write-through cache warm-up โ the first redirect request hits Redis, not PostgreSQL
// Write path: atomic counter โ Base62 short code โ DB persist โ cache warm-up
public String shorten(String longUrl) {
// INCR is atomic across all gateway instances โ globally unique, no collision check
Long id = redis.opsForValue().increment("url:shortener:counter");
String shortCode = base62Encode(id); // 62^6 โ 56B codes at 6 chars vs MD5's 32-char hash
database.save(new UrlMapping(shortCode, longUrl, Instant.now()));
// Write-through: warm cache immediately so the first redirect never touches the DB
redis.opsForValue().set("redirect:" + shortCode, longUrl, Duration.ofDays(30));
return shortCode;
}
// Read path: >99% of redirects exit at the Redis check; DB is the cold-miss fallback
public String resolve(String shortCode) {
String cached = redis.opsForValue().get("redirect:" + shortCode);
if (cached != null) return cached;
return database.findByShortCode(shortCode)
.map(m -> {
redis.opsForValue().set("redirect:" + shortCode, m.longUrl(), Duration.ofDays(30));
return m.longUrl();
})
.orElseThrow(() -> new ShortCodeNotFoundException(shortCode));
}
private String base62Encode(long id) { /* 0-9A-Za-z left-to-right encoding */ ... }
The INCR counter is why Base62 beats MD5 here: MD5 produces a random hash that must be checked for uniqueness on every write (a database round-trip), whereas a monotonically increasing counter is guaranteed unique by definition. Lettuce's non-blocking connection pool means the INCR call never blocks an application thread โ at 100K RPS the event loop processes Redis responses asynchronously.
For a full deep-dive on Base62 ID generation strategies, counter sharding for write throughput, and cache-first redirect path optimisation, a dedicated follow-up post is planned.
๐ Lessons Learned
- Start with actors and use cases before drawing any diagram.
- Define in-scope and out-of-scope boundaries to prevent architecture sprawl.
- Convert NFRs into measurable SLO-style targets.
- Separate functional HLD from non-functional deep dive reasoning.
- Scale the first measured bottleneck, not the most visible component.
๐ TLDR: Summary & Key Takeaways
- Template-aligned answers are clearer, faster to evaluate, and easier to communicate.
- Good HLDs explain both request flow and state update flow.
- Non-functional architecture determines reliability under pressure.
- Phase-based evolution outperforms one-shot overengineering.
- Theory-linked reasoning improves consistency across different interview prompts.
๐ Practice Quiz
- Why should system design answers begin with actors and use cases?
A) To avoid architecture work entirely
B) To anchor architecture decisions to workload and user behavior
C) To skip non-functional requirements
Correct Answer: B
- Which section should define p95 and p99 targets?
A) Non Functional Requirements
B) Only the quiz section
C) Only the related posts section
Correct Answer: A
- What is the primary benefit of separating synchronous and asynchronous paths?
A) It removes all consistency trade-offs
B) It isolates latency-critical user flows from heavy side effects
C) It eliminates monitoring needs
Correct Answer: B
- Open-ended challenge: for this design, which component would you scale first at 10x traffic and which metric would you use to justify that decision?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together โ and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally โ without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
