System Design HLD Example: Rate Limiter (Global API Protection)
Design a production-grade HLD for distributed rate limiting with fairness and low latency.
Abstract AlgorithmsTLDR: Design a distributed rate limiter for public APIs. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive.
TLDR: A rate limiter controls request volume per identity and time window to protect backends from abuse and overload.
In 2016 a misconfiguration allowed a single GitHub OAuth application to inadvertently spike to 50,000 requests per minute, causing tenant degradation for unrelated users on the same API tier. Rate limiting was already present โ but enforced per process: 20 gateway pods each allowed 2,500 req/min independently rather than coordinating a shared 2,500 req/min global budget. The fix required replacing per-process counters with a distributed atomic counter store visible to all gateway instances.
Designing a distributed rate limiter teaches you a precise lesson about shared mutable state at scale: the same race conditions that affect application business logic also affect the infrastructure protecting it, and solving them requires the same atomic-increment primitives.
By the end of this walkthrough you'll know why Redis INCR with a TTL key is the canonical implementation (atomic increment, O(1) per check, no distributed lock), why token bucket is preferred over fixed-window counting for burst-tolerant APIs, and why local estimation with gossip-based synchronisation is the right trade-off when a centralised Redis approaches 1M req/s.
๐ Use Cases
Actors
- End users consuming the primary product surface.
- Producer entities that create or update domain content.
- Platform services enforcing policy, routing, and reliability controls.
Use Cases
- Primary interview prompt: Design a distributed rate limiter for public APIs.
- Core user journeys: Support per-user or per-key policies, endpoint quotas, burst handling, and retry metadata.
- Read and write paths are explained separately so bottlenecks and consistency boundaries are explicit.
This template starts with actors and use cases because architecture only makes sense when user behavior and workload shape are clear. In interviews, this section prevents random tool selection and keeps the answer grounded in business outcomes.
๐ Functional Requirements
In Scope
- Support the core product flow end-to-end with clear API contracts.
- Preserve business correctness for critical operations.
- Expose reliable read and write interfaces with predictable behavior.
- Support an incremental scaling path instead of requiring a redesign.
Out of Scope (v1 boundary)
- Full global active-active writes across every region.
- Heavy analytical workloads mixed into latency-critical request paths.
- Complex personalization experiments in the first architecture version.
Functional Breakdown
- Prompt: Design a distributed rate limiter for public APIs.
- Focus: Support per-user or per-key policies, endpoint quotas, burst handling, and retry metadata.
- Initial building-block perspective: Policy service, token-bucket engine, distributed counter store, and edge enforcement integrated with gateway.
A strong answer names non-goals explicitly. Interviewers use this to judge prioritization quality and architectural maturity under time constraints.
โ๏ธ Non Functional Requirements
| Dimension | Target | Why it matters |
| Scalability | Horizontal scale across services and workers | Handles growth without rewriting core flows |
| Availability | 99.9% baseline with path to 99.99% | Reduces user-visible downtime |
| Performance | Clear p95 and p99 latency SLOs | Avoids average-latency blind spots |
| Consistency | Explicit strong vs eventual boundaries | Prevents hidden correctness defects |
| Operability | Metrics, logs, traces, and runbooks | Speeds incident isolation and recovery |
Non-functional requirements are where many designs fail in practice. Naming measurable targets and coupling architecture decisions to those targets is far more useful than listing technologies.
๐ง Deep Dive: Estimations and Design Goals
The Internals
- Service boundaries should align with ownership and deployment isolation.
- Data model choices should follow access patterns, not default preferences.
- Retries, idempotency, and timeout budgets must be explicit before scale.
- Dependency failure behavior should be defined before incidents happen.
Estimations
Use structured rough-order numbers in interviews:
- Read and write throughput (steady and peak).
- Read/write ratio and burst amplification factor.
- Typical payload size and large-object edge cases.
- Daily storage growth and retention horizon.
- Cache memory for hot keys and frequently accessed entities.
| Estimation axis | Question to answer early |
| Read QPS | Which read path saturates first at 10x? |
| Write QPS | Which state mutation becomes the first bottleneck? |
| Storage growth | When does repartitioning become mandatory? |
| Memory envelope | What hot set must remain in memory? |
| Network profile | Which hops create the highest latency variance? |
Design Goals
- Keep synchronous user-facing paths short and deterministic.
- Shift heavy side effects and fan-out work to asynchronous channels.
- Minimize coupling between control-plane and data-plane components.
- Introduce complexity in phases tied to measurable bottlenecks.
Performance Analysis
| Pressure point | Symptom | First response | Second response |
| Hot partitions | Tail latency spikes | Key redesign | Repartition by load |
| Cache churn | Miss storms | TTL and key tuning | Multi-layer caching |
| Async backlog | Delayed downstream work | Worker scale-out | Priority queues |
| Dependency instability | Timeout cascades | Fail-fast budgets | Degraded fallback mode |
Metrics that should drive architecture evolution:
- p95 and p99 latency by operation.
- Error-budget burn by service and endpoint.
- Queue lag, retry volume, and dead-letter trends.
- Cache hit ratio by key family.
- Partition or shard utilization skew.
๐ High Level Design - Architecture for Functional Requirements
Building Blocks
- Policy service, token-bucket engine, distributed counter store, and edge enforcement integrated with gateway.
- API edge layer for authentication, authorization, and policy checks.
- Domain services for read and write responsibilities.
- Durable storage plus cache for fast retrieval and controlled consistency.
- Async event path for secondary processing and integrations.
Design the APIs
- Keep contracts explicit and version-friendly.
- Use idempotency keys for retriable writes.
- Return actionable error metadata for clients and retries.
Communication Between Components
- Synchronous path for user-visible confirmation.
- Asynchronous path for fan-out, indexing, notifications, and analytics.
Data Flow
- Request -> gateway policy check -> token budget -> allow or reject -> emit usage metrics.
flowchart TD
A[Client or Producer] --> B[API and Policy Layer]
B --> C[Core Domain Service]
C --> D[Primary Data Store and Cache]
C --> E[Async Event or Job Queue]
D --> F[User-Facing Response]
E --> G[Workers and Integrations]
G --> H[State Update and Telemetry]
๐ Real-World Applications: API Mapping and Real-World Applications
This architecture pattern appears in real production systems because traffic is bursty, dependencies fail partially, and correctness requirements vary by operation type.
Practical API mapping examples:
- POST /resources for write operations with idempotency support.
- GET /resources/{id} for low-latency object retrieval.
- GET /resources?cursor= for scalable pagination and stable traversal.
- Async event emissions for indexing, notifications, and reporting.
Real-world system behavior is defined during failure, not normal operation. Good designs clearly specify what can be stale, what must be exact, and what should fail fast to preserve reliability.
โ๏ธ Trade-offs & Failure Modes (Design Deep Dive for Non Functional Requirements)
Scaling Strategy
- Scale stateless services horizontally behind load balancing.
- Partition stateful data by access-pattern-aware keys.
- Add queue-based buffering where write bursts exceed synchronous capacity.
Availability and Resilience
- Multi-instance deployment across failure domains.
- Replication and failover planning for stateful systems.
- Circuit breakers, retries with backoff, and bounded timeouts.
Storage and Caching
- Cache-aside for read-heavy access paths.
- Explicit invalidation and refresh policy.
- Tiered storage for hot, warm, and cold access profiles.
Consistency, Security, and Monitoring
- Clear strong vs eventual consistency contracts per operation.
- Authentication, authorization, and encryption in transit and at rest.
- Monitoring stack with metrics, logs, traces, SLO dashboards, and alerting.
This section is the architecture-for-NFRs view from your template. It explains how the system remains stable under scale, failures, and incident pressure.
๐งญ Decision Guide
| Situation | Recommendation |
| Early stage with moderate traffic | Keep architecture minimal and highly observable |
| Read-heavy workload dominates | Optimize cache and read model before complex rewrites |
| Write hotspots appear | Rework key strategy and partitioning plan |
| Incident frequency increases | Strengthen SLOs, runbooks, and fallback controls |
๐งช Practical Example for Interview Delivery
A repeatable way to deliver this design in interviews:
- Start with actors, use cases, and scope boundaries.
- State estimation assumptions (QPS, payload size, storage growth).
- Draw HLD and explain each component responsibility.
- Walk through one failure cascade and mitigation strategy.
- Describe phase-based evolution for 10x traffic.
Question-specific practical note:
- Keep decisions near ingress, use idempotent decrement logic, and apply multi-window quotas for fairness.
A concise closing sentence that works well: "I would launch with this minimal architecture, monitor p95 latency, error-budget burn, and queue lag, then scale the first saturated component before adding further complexity."
๐๏ธ Advanced Concepts for Production Evolution
When interviewers ask follow-up scaling questions, use a phased approach:
- Stabilize critical path dependencies with better observability.
- Increase throughput by isolating heavy side effects asynchronously.
- Reduce hotspot pressure through key redesign and repartitioning.
- Improve resilience using automated failover and tested runbooks.
- Expand to multi-region only when latency, compliance, or reliability targets require it.
This framing demonstrates that architecture decisions are tied to measurable outcomes, not architecture fashion trends.
๐ ๏ธ Redis-Backed Token Bucket: The One Decision That Solves the Multi-Instance Problem
The core architectural decision in a distributed rate limiter is replacing per-process counters with a Redis-backed atomic token bucket shared across all gateway instances. Without this, 20 gateway pods each enforcing 5,000 req/min independently create a global effective limit of 100,000 req/min โ not the intended 5,000.
// Redis-backed distributed bucket โ ALL gateway instances share the same counter in Redis
// Each (apiKey, endpoint) pair gets its own bucket key: "ratelimit:{apiKey}:{endpoint}"
BucketConfiguration config = BucketConfiguration.builder()
.addLimit(BandwidthBuilder.builder()
.capacity(100)
.refillGreedy(100, Duration.ofMinutes(1)) // token-bucket: smooth refill, not fixed window
.initialTokens(20) // burst allowance: first 20 requests are instant
.build())
.build();
Bucket bucket = buckets.builder()
.build("ratelimit:" + apiKey + ":" + endpoint, () -> config);
// bucket.tryConsume(1): executes an atomic CAS Lua script on Redis โ O(1), no distributed lock
// Returns false โ caller should respond HTTP 429 with Retry-After: 60
boolean allowed = bucket.tryConsume(1);
bucket.tryConsume(1) executes a compare-and-swap Lua script on Redis โ the token decrement is atomic across all gateway instances sharing the same Redis cluster. This is why token bucket beats fixed-window counting for this use case: a 10-second burst of 20 requests followed by silence is handled gracefully by the initialTokens parameter, whereas a fixed window would reject legitimate burst traffic at window boundaries.
A raw Redis Lua script (INCR + EXPIRE) is the minimal alternative โ the trade-off is that smooth token-bucket refill semantics must be implemented manually in the script rather than delegated to Bucket4j.
For a full deep-dive on multi-tier quota policies, token bucket vs. sliding window trade-offs, and Redis cluster failover for rate limit state, a dedicated follow-up post is planned.
๐ Lessons Learned
- Start with actors and use cases before drawing any diagram.
- Define in-scope and out-of-scope boundaries to prevent architecture sprawl.
- Convert NFRs into measurable SLO-style targets.
- Separate functional HLD from non-functional deep dive reasoning.
- Scale the first measured bottleneck, not the most visible component.
๐ TLDR: Summary & Key Takeaways
- Template-aligned answers are clearer, faster to evaluate, and easier to communicate.
- Good HLDs explain both request flow and state update flow.
- Non-functional architecture determines reliability under pressure.
- Phase-based evolution outperforms one-shot overengineering.
- Theory-linked reasoning improves consistency across different interview prompts.
๐ Practice Quiz
- Why should system design answers begin with actors and use cases?
A) To avoid architecture work entirely
B) To anchor architecture decisions to workload and user behavior
C) To skip non-functional requirements
Correct Answer: B
- Which section should define p95 and p99 targets?
A) Non Functional Requirements
B) Only the quiz section
C) Only the related posts section
Correct Answer: A
- What is the primary benefit of separating synchronous and asynchronous paths?
A) It removes all consistency trade-offs
B) It isolates latency-critical user flows from heavy side effects
C) It eliminates monitoring needs
Correct Answer: B
- Open-ended challenge: for this design, which component would you scale first at 10x traffic and which metric would you use to justify that decision?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together โ and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally โ without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
