All Posts

System Design HLD Example: URL Shortener (TinyURL and Bitly)

A practical interview-ready HLD for a short-link platform with heavy read traffic.

Abstract AlgorithmsAbstract Algorithms
ยทยท21 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Design a URL shortener like TinyURL or Bitly. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive.

TLDR: A URL shortener converts long links into compact aliases while serving extremely fast redirect reads at scale.

Twitter's early URL sharing created a real problem: long URLs broke tweet character limits and concealed link destinations from users. A URL shortener looks deceptively simple until you need to serve 100 million redirects per day with sub-10ms latency and guarantee zero short-code collisions under 10K writes per second โ€” at that point a naive random string generator with a uniqueness check becomes a database bottleneck.

Understanding how to design a URL shortener teaches you core patterns shared by nearly every high-scale read-heavy system: ID generation strategies, cache-first redirect paths, database sharding, and the write amplification trade-off between counter-based and hash-based code generation.

By the end of this walkthrough you'll know why a base62-encoded auto-increment counter beats MD5 hashing (no collision detection loop required, shorter output codes), why a Redis redirect cache in front of your database is non-negotiable at 10K RPS (over 99% of traffic is reads, not writes), and why you'd shard the URL mapping table by hash(short_code) rather than creation timestamp to avoid hot partition accumulation.

๐Ÿ“– Use Cases

Actors

ActorRole
End userClicks a short link; expects instant redirect with no visible latency
Link creatorSubmits a long URL via API or dashboard; receives a 6-char short code
Brand / marketerRequests a custom alias (vanity slug) for campaign tracking
Analytics consumerReads aggregated click data: counts, referrer, geo, device
Platform serviceEnforces rate limits, spam detection, and expiry eviction

Use Cases

  • Primary interview prompt: Design a URL shortener like TinyURL or Bitly.
  • Core user journeys:
    • Create short link โ€” POST a long URL; system generates a unique 6-char Base62 code and returns it
    • Redirect by code โ€” GET /{shortCode} resolves to the original URL and issues a 301/302 redirect
    • Optional custom alias โ€” creator specifies a preferred slug (e.g., bit.ly/launch-2025); system checks for collision
    • Expiry support โ€” creator sets a TTL or hard expiry date; expired codes return 410 Gone
    • Click analytics โ€” every redirect emits an async event capturing timestamp, referrer, IP geo-hash, and user-agent family
  • Read and write paths are explained separately so bottlenecks and consistency boundaries are explicit.

This template starts with actors and use cases because architecture only makes sense when user behavior and workload shape are clear. In interviews, this section prevents random tool selection and keeps the answer grounded in business outcomes.

๐Ÿ” Functional Requirements

In Scope

  • Short link creation โ€” POST /shorten with long URL; returns short code with idempotency key support
  • Redirect โ€” GET /{shortCode} with cache-first lookup; sub-10ms p95 on cache hit
  • Custom alias โ€” optional user-defined slug with uniqueness validation
  • Expiry โ€” TTL-based and hard-date expiry; expired codes return 410 Gone
  • Click analytics โ€” async event per redirect (timestamp, referrer, geo-hash, device family)

Out of Scope (v1 boundary)

  • Real-time analytics dashboards โ€” async pipeline is built; UI is out of scope
  • Full global active-active writes across every region
  • A/B test link variants or personalized redirects
  • QR code generation (can be a thin adapter on top of the short code)

Functional Breakdown

FeatureAPI ContractKey Decision
Create short linkPOST /shorten โ†’ { shortCode, shortUrl, expiry }Base62(INCR counter) vs MD5 hash
Redirect by codeGET /{shortCode} โ†’ 302 Location: <longUrl>Redis cache-first; DB fallback
Custom aliasPOST /shorten with alias fieldCollision check on writes; reject or retry
Expiry supportexpiresAt field on create; eviction job or Redis TTLSoft delete vs hard TTL
Click analyticsFire-and-forget Kafka event on every redirectAsync โ€” never blocks the redirect path

Initial building blocks:

  • Code generation service โ€” Redis INCR counter + Base62 encoding โ†’ guaranteed-unique 6-char codes
  • Redirect API โ€” stateless HTTP layer; cache-first lookup with DB fallback and write-through warm-up
  • Metadata store โ€” PostgreSQL table (short_code PK, long_url, created_at, expires_at, owner_id)
  • Cache tier โ€” Redis GET redirect:{code} โ†’ 99%+ hit rate on hot codes; TTL mirrors link expiry
  • Async analytics pipeline โ€” Kafka topic url.redirect.events โ†’ Flink consumer โ†’ ClickHouse aggregation

A strong answer names non-goals explicitly. Interviewers use this to judge prioritization quality and architectural maturity under time constraints.

โš™๏ธ Non Functional Requirements

DimensionTargetWhy it matters for a URL shortener
Scalability10K writes/sec; 100K redirects/sec peakRead:write ratio โ‰ˆ 100:1 โ€” scale read and write paths independently
Availability99.99% on the redirect path; 99.9% on creationA broken redirect is user-facing failure; creation can tolerate brief outages
PerformanceRedirect p95 < 10ms (cache hit) / < 50ms (DB fallback); creation p95 < 100msUsers expect near-instant redirects
ConsistencyShort code โ†’ URL: strong (creation must be globally unique); click counts: eventualCorrectness on writes; throughput on analytics reads
OperabilityCache hit ratio, redirect p95/p99, Kafka consumer lag, DB row growthCore SLO signals that drive scaling decisions

Non-functional requirements are where many designs fail in practice. Naming measurable targets and coupling architecture decisions to those targets is far more useful than listing technologies.

๐Ÿง  Deep Dive: Estimations and Design Goals

The Internals: Data Model

The primary table is url_mappings. Every short code โ€” whether auto-generated or a custom alias โ€” maps to exactly one long URL with optional expiry.

-- PostgreSQL schema; table is sharded by hash(short_code) % N
CREATE TABLE url_mappings (
    short_code   VARCHAR(12)  PRIMARY KEY,   -- 6 chars (auto) or up to 12 (alias)
    long_url     TEXT         NOT NULL,
    owner_id     UUID         NOT NULL,
    created_at   TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    expires_at   TIMESTAMPTZ,                -- NULL means no expiry
    deleted_at   TIMESTAMPTZ,                -- NULL means active; set on expiry/delete
    is_custom    BOOLEAN      NOT NULL DEFAULT FALSE
);

-- Expiry worker uses this index to batch-find expired codes efficiently
CREATE INDEX idx_expires ON url_mappings (expires_at)
    WHERE expires_at IS NOT NULL AND deleted_at IS NULL;

Why VARCHAR(12) and not VARCHAR(6)? Auto-generated codes are always 6 chars. Custom aliases can be up to 12. One column covers both without a separate table.

Redis key layout:

Key patternValueTTL
redirect:{short_code}`{longUrl}{expiresEpoch}` packed stringMirrors expires_at; 30 days if no expiry
counter:url_shortenerAuto-increment integerNo TTL โ€” persists forever
alias_reserved:{slug}"1" (existence flag)Mirrors alias expiry; deleted when link expires

Why store expiresEpoch in the Redis value? The redirect service can check expiry without a DB round-trip: parse the epoch from the cached value, compare to System.currentTimeMillis(), and return 410 if expired โ€” even on a cache hit.

Estimations

Assumptions for a Bitly-scale service:

DimensionAssumptionDerivation
Link creations100M new links/dayโ‰ˆ 1,160 writes/sec steady; 10K/sec burst
Redirects10B/dayโ‰ˆ 115K reads/sec steady; 500K/sec burst
Read : Write ratio~100 : 1Drives cache-first architecture
URL mapping row size~500 bytesshort_code (7B) + long_url (avg 200B) + metadata
Storage growth50 GB/day100M ร— 500B; 1-year retention โ‰ˆ 18 TB
Hot key cache~10 GB RedisTop 1% of codes handle 80% of traffic (Zipf distribution)
Kafka throughput115K events/secOne event per redirect; avg message size ~200B

Key insight: >99% of traffic is reads. The redirect cache hit ratio is the single most important operational metric โ€” a 1% drop in hit rate at 115K RPS means 1,150 extra DB queries per second.

Design Goals

GoalWhy it mattersDesign decision it drives
Code uniqueness is non-negotiableTwo codes mapping to different URLs breaks user trust permanentlyINCR counter over MD5 โ€” no collision retry loop ever needed
Redirect must never block on analyticsClick recording is best-effort; a blocking Kafka publish must not delay the 302 responseFire-and-forget async publish โ€” never await the Kafka future
Cache hit rate โ‰ฅ 99%At 115K redirects/sec, a 2% miss rate = 2,300 DB reads/sec โ†’ PostgreSQL saturationWrite-through on creation; LRU with enough memory for the hot 1% of codes
Expired codes must return 410, not 404404 = "never existed"; 410 = "existed but gone" โ€” important for SEO crawlers and link-rot debuggingSoft-delete with deleted_at; expiry check on every response
Custom alias collisions must be rejectedSilent alias overwrite routes users to the wrong destinationAtomic INSERT ... WHERE NOT EXISTS; return 409 Conflict if the slug is taken

Performance Analysis

Pressure pointSymptomFirst responseSecond response
Hot partitionsTail latency spikesKey redesignRepartition by load
Cache churnMiss stormsTTL and key tuningMulti-layer caching
Async backlogDelayed downstream workWorker scale-outPriority queues
Dependency instabilityTimeout cascadesFail-fast budgetsDegraded fallback mode

URL shortener health signals โ€” the five metrics that matter most:

MetricAlert thresholdWhat it reveals
redirect.cache_hit_ratio< 99%Redis memory pressure, cold-start spike, or cache poisoning
redirect.p95_latency_ms (cache path)> 15msSlow Redis connection pool or Lettuce thread starvation
redirect.p95_latency_ms (DB fallback)> 80msPostgreSQL shard hot-spot or connection pool exhaustion
kafka.consumer_lag (click events)> 10K eventsFlink under-provisioned; analytics freshness degrading
expiry_worker.deleted_per_run> 500K rowsExpiry backlog growing; increase worker frequency or batch size

๐Ÿ“Š High Level Design - Architecture for Functional Requirements

Building Blocks

ComponentResponsibilityTechnology choice
API GatewayRate limiting, auth, TLS terminationNginx / AWS API Gateway
Shortener ServiceCode generation, alias validation, DB write, cache warm-upSpring Boot stateless pods
Redirect ServiceCache-first lookup, 302 response, click event emitSpring Boot (ultra-low latency)
Metadata StoreDurable (short_code โ†’ long_url, expiry, owner) mappingPostgreSQL, sharded by hash(short_code)
Cache TierHot-path redirect resolution; TTL mirrors link expiryRedis Cluster
Analytics PipelinePer-redirect click events โ†’ aggregated metricsKafka โ†’ Flink โ†’ ClickHouse
Expiry WorkerSoft-delete expired codes; evict from cacheScheduled Spring Batch job

Design the APIs

Create short link:

POST /shorten
Content-Type: application/json
Idempotency-Key: <uuid>

{
  "longUrl": "https://example.com/very/long/path?ref=campaign&utm_source=email",
  "alias": "launch-2025",        // optional custom slug
  "expiresAt": "2025-12-31T23:59:59Z"  // optional TTL
}

HTTP 201 Created
{
  "shortCode": "aB3kZ9",
  "shortUrl": "https://tiny.example.com/aB3kZ9",
  "expiresAt": "2025-12-31T23:59:59Z"
}

Redirect:

GET /aB3kZ9

HTTP 302 Found
Location: https://example.com/very/long/path?ref=campaign&utm_source=email

Click analytics:

GET /analytics/aB3kZ9?from=2025-01-01&to=2025-01-31

HTTP 200 OK
{
  "shortCode": "aB3kZ9",
  "totalClicks": 84201,
  "topReferrers": ["twitter.com", "email", "direct"],
  "topCountries": ["US", "IN", "GB"],
  "clicksPerDay": [{ "date": "2025-01-01", "count": 3120 }, ...]
}

Key design decisions:

  • Idempotency-Key header on POST ensures safe retries without duplicate code creation
  • 301 Permanent vs 302 Temporary redirect: use 302 so analytics can track future clicks (301 is cached by browsers and bypasses the shortener)
  • Analytics API reads from ClickHouse (eventual), not PostgreSQL (strong) โ€” latency vs freshness trade-off is explicit

Communication Between Components

PathProtocolWhy
Client โ†’ Redirect ServiceHTTPS (synchronous)User-facing; must be fast and reliable
Redirect Service โ†’ RedisTCP (synchronous)Cache hit exits here โ€” no DB touch
Redirect Service โ†’ PostgreSQLJDBC (synchronous, fallback only)Cold miss; DB read + write-through back to Redis
Redirect Service โ†’ KafkaAsync publish (fire-and-forget)Click event; never blocks the redirect response
Kafka โ†’ Flink โ†’ ClickHouseStreamingAnalytics aggregation; eventual consistency acceptable
Expiry Worker โ†’ PostgreSQL + RedisBatchNightly soft-delete of expired codes; Redis key eviction

Data Flow

Write path (create short link):

Client
  โ†’ POST /shorten
  โ†’ Shortener Service
      โ†’ Redis INCR url:counter        (atomic, globally unique ID)
      โ†’ Base62 encode(id)             (6-char short code)
      โ†’ PostgreSQL INSERT              (durable mapping)
      โ†’ Redis SET redirect:{code}     (write-through cache warm-up)
  โ† 201 { shortCode, shortUrl }

Read path (redirect):

Client
  โ†’ GET /{shortCode}
  โ†’ Redirect Service
      โ†’ Redis GET redirect:{shortCode}    (cache hit: ~99% of requests)
          โ†’ 302 redirect immediately
      โ†’ PostgreSQL SELECT (cache miss only)
          โ†’ Redis SET (write-through)
          โ†’ 302 redirect
      โ†’ Kafka publish click event         (async, fire-and-forget)

Analytics path:

Kafka topic: url.redirect.events
  โ†’ Flink streaming job
      โ†’ Aggregate: clicks per code, referrer, geo
  โ†’ ClickHouse (columnar, append-optimised)
  โ†’ Analytics API (GET /analytics/{code})
flowchart TD
    A[Client] -->|POST /shorten| B[Shortener Service]
    B -->|INCR + Base62| C[Redis Counter]
    B -->|INSERT mapping| D[PostgreSQL]
    B -->|SET redirect:code| E[Redis Cache]
    B -->|201 shortCode| A

    F[Client] -->|GET shortCode| G[Redirect Service]
    G -->|GET redirect:code| E
    E -->|cache hit 99%| H[302 Redirect]
    G -->|cache miss| D
    D --> G
    G -->|fire-and-forget| I[Kafka]
    I --> J[Flink โ†’ ClickHouse]

๐ŸŒ Real-World Applications: API Mapping and Real-World Applications

This architecture pattern powers multiple production systems. Here is how each URL shortener feature maps to the design:

FeatureReal-world exampleDesign element
Create short linkBitly API, Twitter t.co auto-shorteningPOST /shorten โ†’ INCR + Base62
RedirectAll short links โ€” bit.ly/xyz โ†’ destinationRedis cache-first GET + 302
Custom aliasow.ly/brand-launch campaign linksAlias field + collision check
ExpiryQR codes for time-limited promotionsexpiresAt + eviction worker
Click analyticsBitly dashboard, campaign UTM trackingKafka โ†’ Flink โ†’ ClickHouse

301 vs 302 โ€” a decision that affects revenue: Twitter's t.co and Bitly both use 302 (temporary redirect) despite the permanent nature of most links. A 301 would be cached by browsers, bypassing the shortener entirely on repeat visits โ€” which means zero analytics and zero ability to update the destination URL after creation. The performance cost of the extra hop is 10โ€“50ms; the business value of every trackable click justifies it.

๐Ÿ”‘ Feature Deep Dive: Custom Alias

A custom alias is a user-chosen slug (e.g., bit.ly/launch-2025) instead of the auto-generated 6-char code. The challenge is collision detection at write time without a race condition.

The naive approach fails: read-then-write has a race window where two callers check the same slug simultaneously, both see "available", and both succeed โ€” overwriting each other.

The correct approach: use a database-level unique constraint and let the INSERT fail atomically:

public ShortenResult shorten(ShortenRequest req) {
    String shortCode;

    if (req.alias() != null) {
        shortCode = validateAlias(req.alias()); // length, charset check
    } else {
        Long id = redis.opsForValue().increment("counter:url_shortener");
        shortCode = base62Encode(id);
    }

    try {
        // PostgreSQL enforces uniqueness via PRIMARY KEY โ€” no race condition possible
        database.save(new UrlMapping(shortCode, req.longUrl(), req.expiresAt(), req.alias() != null));
    } catch (DataIntegrityViolationException e) {
        // Alias was taken between our check and the insert โ€” surface a clean 409
        throw new AliasTakenException(shortCode);
    }

    // Write-through cache warm-up
    String cacheValue = buildCacheValue(req.longUrl(), req.expiresAt());
    redis.opsForValue().set("redirect:" + shortCode, cacheValue, ttlFor(req.expiresAt()));

    if (req.alias() != null) {
        // Reserve alias flag for fast collision check on future creates
        redis.opsForValue().set("alias_reserved:" + shortCode, "1", ttlFor(req.expiresAt()));
    }

    return new ShortenResult(shortCode, buildShortUrl(shortCode));
}

Custom alias rules enforced at the API layer (before the DB hit):

  • Length: 4โ€“12 characters
  • Charset: [A-Za-z0-9-_] only โ€” no spaces, no special characters
  • Blocklist: reserved slugs like api, analytics, admin, health are rejected with 422

โฐ Feature Deep Dive: Expiry Support

Expiry has two independent halves: serving expired codes correctly and cleaning up expired rows in the background.

Serving Expired Codes

The redirect service must detect expiry even on a cache hit โ€” because the Redis TTL may not have fired yet due to lazy eviction. The solution is to embed the expiry epoch inside the cached value:

// Cache value format: "https://example.com/long/url|1735689600"
//                      long URL              | expires_at epoch (0 = no expiry)

public RedirectResult resolve(String shortCode) {
    String cached = redis.opsForValue().get("redirect:" + shortCode);

    if (cached != null) {
        String[] parts = cached.split("\\|", 2);
        String longUrl = parts[0];
        long expiresEpoch = parts.length > 1 ? Long.parseLong(parts[1]) : 0;

        if (expiresEpoch > 0 && Instant.now().getEpochSecond() > expiresEpoch) {
            // Code is cached but has expired โ€” evict and return 410
            redis.delete("redirect:" + shortCode);
            throw new LinkExpiredException(shortCode);
        }
        return new RedirectResult(longUrl);
    }

    // Cache miss: check DB
    return database.findActive(shortCode)
        .map(m -> {
            redis.opsForValue().set("redirect:" + shortCode, buildCacheValue(m), ttlFor(m.expiresAt()));
            return new RedirectResult(m.longUrl());
        })
        .orElseThrow(() -> new ShortCodeNotFoundException(shortCode));
}

Background Expiry Worker

Redis TTL handles eviction from the cache layer automatically. PostgreSQL requires a background job to soft-delete expired rows:

-- Runs every 15 minutes via Spring Batch or a cron job
-- Processes in batches of 10K to avoid long-running transactions
UPDATE url_mappings
SET    deleted_at = NOW()
WHERE  expires_at < NOW()
  AND  deleted_at IS NULL
LIMIT  10000;

Why soft-delete instead of hard DELETE? Soft-delete preserves the row for audit trails, abuse investigations, and link-rot debugging. Hard DELETE runs after a 90-day grace period.

Expiry lifecycle summary:

Link created with expiresAt=T
  โ†’ PostgreSQL row: expires_at=T, deleted_at=NULL
  โ†’ Redis key: TTL set to (T - now) seconds

At time T:
  โ†’ Redis: may evict lazily (TTL fires) or on next access (active check)
  โ†’ PostgreSQL: still has expires_at=T, deleted_at=NULL (background worker hasn't run yet)

Redirect request arrives at T+5min (before worker runs):
  โ†’ Cache hit: epoch check detects expiry โ†’ 410, evict key
  โ†’ Cache miss: DB query finds row, expires_at < NOW() โ†’ 410

Background worker runs at T+15min:
  โ†’ Sets deleted_at=NOW() on all expired rows
  โ†’ Hard DELETE runs at T + 90 days

๐Ÿ“ˆ Feature Deep Dive: Click Analytics Pipeline

Every redirect emits a lightweight event to Kafka. The analytics pipeline is intentionally decoupled from the redirect path โ€” a Flink cluster failure should never affect redirect latency.

Kafka Event Schema

{
  "eventType": "url.redirect",
  "shortCode": "aB3kZ9",
  "timestamp": "2025-01-15T14:23:11.442Z",
  "referrer": "twitter.com",
  "userAgent": "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0)",
  "deviceFamily": "mobile",
  "countryCode": "US",
  "ipGeoHash": "9q8yy"
}
  • ipGeoHash instead of raw IP: geo-hashed to cell-level (โ‰ˆ5kmยฒ) at the redirect service before emission โ€” raw IPs never leave the request context.
  • Topic partitioning key: shortCode โ€” ensures all events for the same code land on the same Flink operator, enabling stateful aggregation without a shuffle.
Kafka source (url.redirect.events)
  โ†’ KeyBy(shortCode)
  โ†’ TumblingWindow(1 minute)
  โ†’ Aggregate: clickCount, topReferrers (top-K), topCountries (top-K)
  โ†’ Sink โ†’ ClickHouse table: click_events_by_minute

ClickHouse: 
  CREATE TABLE click_events_by_minute (
      short_code    String,
      window_start  DateTime,
      click_count   UInt64,
      top_referrers Array(String),
      top_countries Array(String)
  ) ENGINE = MergeTree()
  PARTITION BY toYYYYMM(window_start)
  ORDER BY (short_code, window_start);

Why ClickHouse? It's a columnar OLAP store purpose-built for aggregation queries over append-only event data. A query like "total clicks for aB3kZ9 in January" scans only the click_count column for the matching short_code โ€” not every column in every row.

โš–๏ธ Trade-offs & Failure Modes (Design Deep Dive for Non Functional Requirements)

Scaling Strategy

BottleneckSymptomFix
Redis INCR single instanceINCR latency spikes at > 500K writes/secPartition counter into 1024 shards; append shard ID to code
PostgreSQL write hotspotINSERT p99 grows as table crosses 1B rowsShard by hash(short_code) % N; never by timestamp
Cache eviction churnRedirect miss rate climbsIncrease Redis memory; tune maxmemory-policy allkeys-lru
Kafka consumer lagClick events delayed > 60sAdd Flink parallelism; scale consumer group partitions

Availability and Resilience

  • Redirect Service is stateless โ€” horizontal scale behind a load balancer; any pod can serve any request
  • Redis failure fallback โ€” if Redis is unreachable, fall back to PostgreSQL directly; accept latency degradation, not downtime
  • PostgreSQL replication โ€” synchronous replica for failover; read replicas for analytics queries (not the redirect path)
  • Circuit breaker โ€” wrap PostgreSQL calls with Resilience4j; open circuit after 5 consecutive failures; return 503 rather than hanging

Storage and Caching

LayerWhat it storesEviction policy
Redis (hot)redirect:{code} โ†’ longUrl for top 1% of codesLRU + per-key TTL matching link expiry
PostgreSQLAll URL mappings; source of truthSoft-delete on expiry; hard-delete after 90-day grace
ClickHouseAggregated click events per code/dayPartition by month; drop partitions > 2 years

Consistency, Security, and Monitoring

Consistency model by operation:

  • Short code creation โ€” strong consistency; PostgreSQL write must succeed before returning shortCode to caller
  • Redirect resolution โ€” eventual (Redis) to strong (PostgreSQL fallback); stale-while-revalidate acceptable for non-expired codes
  • Click counts โ€” eventual; Flink aggregation lag of 5โ€“30 seconds is acceptable

Security:

  • Rate-limit POST /shorten per API key (100 req/min) to prevent code-space exhaustion attacks
  • Validate destination URLs against a phishing/malware blocklist on creation
  • Encrypt owner_id in the mapping table; do not expose creator identity via the redirect path

Key SLO signals to monitor:

  • redirect.cache_hit_ratio โ€” alert if drops below 98%
  • redirect.p95_latency_ms โ€” alert if cache-hit p95 exceeds 15ms
  • kafka.consumer_lag โ€” alert if click event lag exceeds 60s
  • postgres.connection_pool_utilization โ€” alert above 70% to catch cold-miss storms early

๐Ÿงญ Decision Guide

SituationRecommendation
Base62 vs MD5 for code generationBase62(INCR) โ€” no collision loop, shorter codes, monotonic
301 vs 302 redirect302 โ€” preserves analytics; 301 breaks click tracking
Cache write-through vs cache-asideWrite-through โ€” warm cache on creation; first redirect never hits DB
Sharding by timestamp vs short_codeShard by hash(short_code) โ€” avoids hot recent-writes partition
Strong vs eventual for analyticsEventual โ€” click counts at Flink lag (5โ€“30s) is acceptable for dashboards
Custom alias collision strategyReject with 409 + suggest alternatives; never silent override

๐Ÿงช Practical Example for Interview Delivery

A repeatable way to deliver this design in interviews:

  1. Start with actors, use cases, and scope boundaries.
  2. State estimation assumptions (QPS, payload size, storage growth).
  3. Draw HLD and explain each component responsibility.
  4. Walk through one failure cascade and mitigation strategy.
  5. Describe phase-based evolution for 10x traffic.

Question-specific practical note:

  • Use cache-first redirects, write-through mapping, and background click aggregation to protect the primary database.

A concise closing sentence that works well: "I would launch with this minimal architecture, monitor p95 latency, error-budget burn, and queue lag, then scale the first saturated component before adding further complexity."

๐Ÿ—๏ธ Advanced Concepts for Production Evolution

When interviewers ask follow-up scaling questions, use a phased approach:

  1. Stabilize critical path dependencies with better observability.
  2. Increase throughput by isolating heavy side effects asynchronously.
  3. Reduce hotspot pressure through key redesign and repartitioning.
  4. Improve resilience using automated failover and tested runbooks.
  5. Expand to multi-region only when latency, compliance, or reliability targets require it.

This framing demonstrates that architecture decisions are tied to measurable outcomes, not architecture fashion trends.

๐Ÿ› ๏ธ Atomic INCR, Base62 Encoding, and Write-Through Cache: Three Decisions on One Critical Path

Three architectural decisions define the URL shortener's write and read paths:

  1. Redis INCR for ID generation โ€” atomic across all instances, no collision detection loop needed
  2. Base62 encoding โ€” converts the integer to a 6-character code (62โถ โ‰ˆ 56 billion unique codes) rather than MD5, which requires a collision detection loop and produces longer outputs
  3. Write-through cache warm-up โ€” the first redirect request hits Redis, not PostgreSQL
// Write path: atomic counter โ†’ Base62 short code โ†’ DB persist โ†’ cache warm-up
public String shorten(String longUrl) {
    // INCR is atomic across all gateway instances โ€” globally unique, no collision check
    Long id = redis.opsForValue().increment("url:shortener:counter");
    String shortCode = base62Encode(id);  // 62^6 โ‰ˆ 56B codes at 6 chars vs MD5's 32-char hash

    database.save(new UrlMapping(shortCode, longUrl, Instant.now()));

    // Write-through: warm cache immediately so the first redirect never touches the DB
    redis.opsForValue().set("redirect:" + shortCode, longUrl, Duration.ofDays(30));
    return shortCode;
}

// Read path: >99% of redirects exit at the Redis check; DB is the cold-miss fallback
public String resolve(String shortCode) {
    String cached = redis.opsForValue().get("redirect:" + shortCode);
    if (cached != null) return cached;

    return database.findByShortCode(shortCode)
        .map(m -> {
            redis.opsForValue().set("redirect:" + shortCode, m.longUrl(), Duration.ofDays(30));
            return m.longUrl();
        })
        .orElseThrow(() -> new ShortCodeNotFoundException(shortCode));
}

private String base62Encode(long id) { /* 0-9A-Za-z left-to-right encoding */ ... }

The INCR counter is why Base62 beats MD5 here: MD5 produces a random hash that must be checked for uniqueness on every write (a database round-trip), whereas a monotonically increasing counter is guaranteed unique by definition. Lettuce's non-blocking connection pool means the INCR call never blocks an application thread โ€” at 100K RPS the event loop processes Redis responses asynchronously.

For a full deep-dive on Base62 ID generation strategies, counter sharding for write throughput, and cache-first redirect path optimisation, a dedicated follow-up post is planned.

๐Ÿ“š Lessons Learned

  • Start with actors and use cases before drawing any diagram.
  • Define in-scope and out-of-scope boundaries to prevent architecture sprawl.
  • Convert NFRs into measurable SLO-style targets.
  • Separate functional HLD from non-functional deep dive reasoning.
  • Scale the first measured bottleneck, not the most visible component.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Template-aligned answers are clearer, faster to evaluate, and easier to communicate.
  • Good HLDs explain both request flow and state update flow.
  • Non-functional architecture determines reliability under pressure.
  • Phase-based evolution outperforms one-shot overengineering.
  • Theory-linked reasoning improves consistency across different interview prompts.

๐Ÿ“ Practice Quiz

  1. Why should system design answers begin with actors and use cases?

A) To avoid architecture work entirely
B) To anchor architecture decisions to workload and user behavior
C) To skip non-functional requirements

Correct Answer: B

  1. Which section should define p95 and p99 targets?

A) Non Functional Requirements
B) Only the quiz section
C) Only the related posts section

Correct Answer: A

  1. What is the primary benefit of separating synchronous and asynchronous paths?

A) It removes all consistency trade-offs
B) It isolates latency-critical user flows from heavy side effects
C) It eliminates monitoring needs

Correct Answer: B

  1. Open-ended challenge: for this design, which component would you scale first at 10x traffic and which metric would you use to justify that decision?
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms