System Design HLD Example: Distributed Cache Platform
Interview HLD for a distributed cache with eviction, invalidation, and resilience trade-offs.
Abstract AlgorithmsTLDR: Design a distributed cache service for backend workloads. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive.
TLDR: A distributed cache reduces read latency and source-of-truth load while introducing consistency trade-offs.
Instagram's primary database served user profile reads at 28,000 QPS until a single viral post triggered a cache miss storm that drove the DB to 95% CPU utilisation. The fix was not a larger database โ it was a smarter cache topology. Without a distributed cache absorbing read amplification, read-heavy systems hit their database ceiling within months of meaningful traffic growth, regardless of hardware tier.
Designing a distributed cache teaches you the core tension in every high-scale read path: how to keep data consistent enough to be correct while keeping it close enough to be fast โ and when the cost of strict consistency outweighs the cost of serving a stale read.
By the end of this walkthrough you'll know why consistent hashing minimises key reshuffling when a node fails (only 1/N keys remapped, not all of them), why a 95% cache hit ratio is the threshold that prevents database saturation at 50K RPS, and why cache stampedes require probabilistic early expiry rather than a naive shared TTL that expires every key simultaneously.
๐ Use Cases
Actors
- End users consuming the primary product surface.
- Producer entities that create or update domain content.
- Platform services enforcing policy, routing, and reliability controls.
Use Cases
- Primary interview prompt: Design a distributed cache service for backend workloads.
- Core user journeys: Support get and set, TTL, eviction, invalidation signals, and cache observability.
- Read and write paths are explained separately so bottlenecks and consistency boundaries are explicit.
This template starts with actors and use cases because architecture only makes sense when user behavior and workload shape are clear. In interviews, this section prevents random tool selection and keeps the answer grounded in business outcomes.
๐ Functional Requirements
In Scope
- Support the core product flow end-to-end with clear API contracts.
- Preserve business correctness for critical operations.
- Expose reliable read and write interfaces with predictable behavior.
- Support an incremental scaling path instead of requiring a redesign.
Out of Scope (v1 boundary)
- Full global active-active writes across every region.
- Heavy analytical workloads mixed into latency-critical request paths.
- Complex personalization experiments in the first architecture version.
Functional Breakdown
- Prompt: Design a distributed cache service for backend workloads.
- Focus: Support get and set, TTL, eviction, invalidation signals, and cache observability.
- Initial building-block perspective: Cache gateway, shard router, cache node pool, replication channel, invalidation bus, metrics stream.
A strong answer names non-goals explicitly. Interviewers use this to judge prioritization quality and architectural maturity under time constraints.
โ๏ธ Non Functional Requirements
| Dimension | Target | Why it matters |
| Scalability | Horizontal scale across services and workers | Handles growth without rewriting core flows |
| Availability | 99.9% baseline with path to 99.99% | Reduces user-visible downtime |
| Performance | Clear p95 and p99 latency SLOs | Avoids average-latency blind spots |
| Consistency | Explicit strong vs eventual boundaries | Prevents hidden correctness defects |
| Operability | Metrics, logs, traces, and runbooks | Speeds incident isolation and recovery |
Non-functional requirements are where many designs fail in practice. Naming measurable targets and coupling architecture decisions to those targets is far more useful than listing technologies.
๐ง Deep Dive: Estimations and Design Goals
The Internals
- Service boundaries should align with ownership and deployment isolation.
- Data model choices should follow access patterns, not default preferences.
- Retries, idempotency, and timeout budgets must be explicit before scale.
- Dependency failure behavior should be defined before incidents happen.
Estimations
Use structured rough-order numbers in interviews:
- Read and write throughput (steady and peak).
- Read/write ratio and burst amplification factor.
- Typical payload size and large-object edge cases.
- Daily storage growth and retention horizon.
- Cache memory for hot keys and frequently accessed entities.
| Estimation axis | Question to answer early |
| Read QPS | Which read path saturates first at 10x? |
| Write QPS | Which state mutation becomes the first bottleneck? |
| Storage growth | When does repartitioning become mandatory? |
| Memory envelope | What hot set must remain in memory? |
| Network profile | Which hops create the highest latency variance? |
Design Goals
- Keep synchronous user-facing paths short and deterministic.
- Shift heavy side effects and fan-out work to asynchronous channels.
- Minimize coupling between control-plane and data-plane components.
- Introduce complexity in phases tied to measurable bottlenecks.
Performance Analysis
| Pressure point | Symptom | First response | Second response |
| Hot partitions | Tail latency spikes | Key redesign | Repartition by load |
| Cache churn | Miss storms | TTL and key tuning | Multi-layer caching |
| Async backlog | Delayed downstream work | Worker scale-out | Priority queues |
| Dependency instability | Timeout cascades | Fail-fast budgets | Degraded fallback mode |
Metrics that should drive architecture evolution:
- p95 and p99 latency by operation.
- Error-budget burn by service and endpoint.
- Queue lag, retry volume, and dead-letter trends.
- Cache hit ratio by key family.
- Partition or shard utilization skew.
๐ High Level Design - Architecture for Functional Requirements
Building Blocks
- Cache gateway, shard router, cache node pool, replication channel, invalidation bus, metrics stream.
- API edge layer for authentication, authorization, and policy checks.
- Domain services for read and write responsibilities.
- Durable storage plus cache for fast retrieval and controlled consistency.
- Async event path for secondary processing and integrations.
Design the APIs
- Keep contracts explicit and version-friendly.
- Use idempotency keys for retriable writes.
- Return actionable error metadata for clients and retries.
Communication Between Components
- Synchronous path for user-visible confirmation.
- Asynchronous path for fan-out, indexing, notifications, and analytics.
Data Flow
- Read request -> cache shard lookup -> hit/miss decision -> origin fetch -> backfill and invalidate.
flowchart TD
A[Client or Producer] --> B[API and Policy Layer]
B --> C[Core Domain Service]
C --> D[Primary Data Store and Cache]
C --> E[Async Event or Job Queue]
D --> F[User-Facing Response]
E --> G[Workers and Integrations]
G --> H[State Update and Telemetry]
๐ Real-World Applications: API Mapping and Real-World Applications
This architecture pattern appears in real production systems because traffic is bursty, dependencies fail partially, and correctness requirements vary by operation type.
Practical API mapping examples:
- POST /resources for write operations with idempotency support.
- GET /resources/{id} for low-latency object retrieval.
- GET /resources?cursor= for scalable pagination and stable traversal.
- Async event emissions for indexing, notifications, and reporting.
Real-world system behavior is defined during failure, not normal operation. Good designs clearly specify what can be stale, what must be exact, and what should fail fast to preserve reliability.
โ๏ธ Trade-offs & Failure Modes (Design Deep Dive for Non Functional Requirements)
Scaling Strategy
- Scale stateless services horizontally behind load balancing.
- Partition stateful data by access-pattern-aware keys.
- Add queue-based buffering where write bursts exceed synchronous capacity.
Availability and Resilience
- Multi-instance deployment across failure domains.
- Replication and failover planning for stateful systems.
- Circuit breakers, retries with backoff, and bounded timeouts.
Storage and Caching
- Cache-aside for read-heavy access paths.
- Explicit invalidation and refresh policy.
- Tiered storage for hot, warm, and cold access profiles.
Consistency, Security, and Monitoring
- Clear strong vs eventual consistency contracts per operation.
- Authentication, authorization, and encryption in transit and at rest.
- Monitoring stack with metrics, logs, traces, SLO dashboards, and alerting.
This section is the architecture-for-NFRs view from your template. It explains how the system remains stable under scale, failures, and incident pressure.
๐งญ Decision Guide
| Situation | Recommendation |
| Early stage with moderate traffic | Keep architecture minimal and highly observable |
| Read-heavy workload dominates | Optimize cache and read model before complex rewrites |
| Write hotspots appear | Rework key strategy and partitioning plan |
| Incident frequency increases | Strengthen SLOs, runbooks, and fallback controls |
๐งช Practical Example for Interview Delivery
A repeatable way to deliver this design in interviews:
- Start with actors, use cases, and scope boundaries.
- State estimation assumptions (QPS, payload size, storage growth).
- Draw HLD and explain each component responsibility.
- Walk through one failure cascade and mitigation strategy.
- Describe phase-based evolution for 10x traffic.
Question-specific practical note:
- Use cache-aside for read-heavy flows and protect hot keys with coalescing and admission controls.
A concise closing sentence that works well: "I would launch with this minimal architecture, monitor p95 latency, error-budget burn, and queue lag, then scale the first saturated component before adding further complexity."
๐๏ธ Advanced Concepts for Production Evolution
When interviewers ask follow-up scaling questions, use a phased approach:
- Stabilize critical path dependencies with better observability.
- Increase throughput by isolating heavy side effects asynchronously.
- Reduce hotspot pressure through key redesign and repartitioning.
- Improve resilience using automated failover and tested runbooks.
- Expand to multi-region only when latency, compliance, or reliability targets require it.
This framing demonstrates that architecture decisions are tied to measurable outcomes, not architecture fashion trends.
๐ ๏ธ Cache-Aside Read and Write-Invalidate: The Two Redis Decisions That Matter
Two decisions define the cache layer's correctness contract: cache-aside on reads (always check Redis first, only hit the database on a miss, then backfill) and invalidate on writes (evict the cache entry when the source of truth changes, so no stale read can survive beyond the next write).
// Cache-aside: check Redis first โ database is only touched on a miss
public UserProfile getProfile(String userId) {
String key = "userProfiles::" + userId;
UserProfile cached = redis.opsForValue().get(key);
if (cached != null) return cached; // cache hit โ DB bypassed completely
UserProfile fresh = database.findById(userId).orElseThrow();
redis.opsForValue().set(key, fresh, Duration.ofMinutes(30)); // backfill with TTL
return fresh;
}
// Write-invalidate: evict immediately on update โ next read re-warms from DB
public void updateProfile(UserProfile profile) {
database.save(profile);
redis.delete("userProfiles::" + profile.id()); // invalidate; stale entry cannot survive
}
The 30-minute TTL is a safety net, not the primary invalidation mechanism โ the redis.delete() call on every write ensures the cache reflects the latest state as soon as any update commits. For stampede protection on cold misses under high concurrency, a Redisson distributed lock around the DB fetch prevents N threads from simultaneously querying the database for the same key.
Caffeine (in-process W-TinyLFU cache, sub-microsecond latency) serves as an L1 layer in front of Redis when a single JVM handles repeated reads for the same hot key. Hazelcast replaces Redis when cache state must be consistent across cluster members without an external store.
For a full deep-dive on hot-key stampede protection, multi-layer cache topologies, and Redis Cluster consistent hashing, a dedicated follow-up post is planned.
๐ Lessons Learned
- Start with actors and use cases before drawing any diagram.
- Define in-scope and out-of-scope boundaries to prevent architecture sprawl.
- Convert NFRs into measurable SLO-style targets.
- Separate functional HLD from non-functional deep dive reasoning.
- Scale the first measured bottleneck, not the most visible component.
๐ TLDR: Summary & Key Takeaways
- Template-aligned answers are clearer, faster to evaluate, and easier to communicate.
- Good HLDs explain both request flow and state update flow.
- Non-functional architecture determines reliability under pressure.
- Phase-based evolution outperforms one-shot overengineering.
- Theory-linked reasoning improves consistency across different interview prompts.
๐ Practice Quiz
- Why should system design answers begin with actors and use cases?
A) To avoid architecture work entirely
B) To anchor architecture decisions to workload and user behavior
C) To skip non-functional requirements
Correct Answer: B
- Which section should define p95 and p99 targets?
A) Non Functional Requirements
B) Only the quiz section
C) Only the related posts section
Correct Answer: A
- What is the primary benefit of separating synchronous and asynchronous paths?
A) It removes all consistency trade-offs
B) It isolates latency-critical user flows from heavy side effects
C) It eliminates monitoring needs
Correct Answer: B
- Open-ended challenge: for this design, which component would you scale first at 10x traffic and which metric would you use to justify that decision?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together โ and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally โ without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
