Series
System Design Interview Prep
A comprehensive series to help you design scalable, reliable, and fault-tolerant systems.
67
Articles
21h 19m
Estimated reading
Intermediate to Advanced
Knowledge level
2,185
Readers
About this series
A comprehensive series to help you design scalable, reliable, and fault-tolerant systems.
Series Progress
0% Complete0 of 67 articles viewed
Continue Learning
Who is this for?
Software engineers and developers learning this topic.
Knowledge Level
Intermediate to Advanced
Last Updated
May 30, 2026
Created by
Abstract Algorithms
All Articles
Article 1
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node β virtual nod
24 min read
Article 2
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions β but under load, across datacenters, or aft
19 min read

Article 3
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redi
25 min read

Article 4
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader β each accepting writes the other never sees. Prevent it with quorum consensus (at lea
22 min read
Article 5
SQL Partitioning: Range, Hash, List, and Composite Strategies Explained
TLDR: SQL partitioning divides one logical table into smaller physical child tables, all accessed through the parent table name. The query optimizer skips irrelevant child tables entirely β a process
25 min read
Article 6
CosmosDB Partition Internals: Logical vs Physical Partitions Explained
π₯ When Your Database Bill Triples Overnight A retail engineering team ships a flash-sale feature. Traffic spikes 10Γ. Their Azure CosmosDB bill triples within 24 hours. Queries that ran in 5ms now ta
16 min read

Article 7
Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning
TLDR: Partitioning splits one logical table into smaller physical pieces. The database skips irrelevant pieces entirely β turning a 30-second full-table scan into a sub-second single-partition read. S
12 min read

Article 8
Dirty Write Explained: When Uncommitted Data Gets Overwritten
TLDR: A dirty write occurs when Transaction B overwrites data that Transaction A has written but not yet committed. The result is not a rollback or an error β it is silently inconsistent committed dat
28 min read

Article 9
Read Skew Explained: Inconsistent Snapshots Across Multiple Objects
TLDR: Read skew occurs when a transaction reads two logically related objects at different points in time β one before and one after a concurrent transaction commits β producing a view that never exis
34 min read

Article 10
Lost Update Explained: When Two Writes Become One
TLDR: A lost update occurs when two concurrent read-modify-write transactions both read the same committed value, both compute a new value from it, and both write back β with the second write silently
38 min read

Article 11
Phantom Read Explained: When New Rows Appear Mid-Transaction
TLDR: A phantom read occurs when a transaction runs the same range query twice and gets a different set of rows β because a concurrent transaction inserted or deleted matching rows and committed in be
32 min read

Article 12
Write Skew Explained: The Anomaly That Requires Serializable Isolation
TLDR: Write skew is the hardest concurrency anomaly to reason about: two concurrent transactions each read a shared condition, decide they can safely proceed, and then write to different rows. No indi
23 min read
Article 13
Dirty Read Explained: How Uncommitted Data Corrupts Transactions
TLDR: A dirty read occurs when Transaction B reads data written by Transaction A before A has committed. If A rolls back, B has made decisions on data that β from the database's perspective β never ex
30 min read
Article 14
Non-Repeatable Read Explained: When the Same Query Returns Different Results
TLDR: A non-repeatable read happens when the same SELECT returns different results within a single transaction because a concurrent transaction committed an update between the two reads. Read Committe
26 min read

Article 15
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy β but because physics makes perfect consistency impossible across network boundaries. Split brain, stale reads, clock skew, ca
13 min read

Article 16
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose β range, hash, consistent hashing, or di
29 min read

Article 17
Isolation Levels in Databases: Read Committed, Repeatable Read, Snapshot, and Serializable Explained
TLDR: Isolation levels control which concurrency anomalies a transaction can see. Read Committed (PostgreSQL and Oracle's default) prevents dirty reads but still silently allows non-repeatable reads,
28 min read

Article 18
Key Terms in Distributed Systems: The Definitive Glossary
TLDR: Distributed systems vocabulary is precise for a reason. Mixing up read skew and write skew costs you an interview. Confusing Snapshot Isolation with Serializable costs you a production outage. T
51 min read
Article 19
Database Anomalies: How SQL and NoSQL Handle Dirty Reads, Phantom Reads, and Write Skew
TLDR: Database anomalies are the predictable side-effects of concurrent transactions β dirty reads, phantom reads, write skew, and lost updates. SQL databases use MVCC and isolation levels to prevent
31 min read

Article 20
Designing for High Availability: The Road to 99.99% Reliability
TLDR: High Availability (HA) is the art of eliminating Single Points of Failure (SPOFs). By using Active-Active redundancy, automated health checks, and global failover via GSLB, you can achieve "Four
9 min read

Article 21
Choosing the Right Database: CAP Theorem and Practical Use Cases
TLDR: Database selection is a trade-off between consistency, availability, and scalability. By using the CAP Theorem as a compass and matching your data access patterns to the right storage engine (Re
7 min read
Article 22
ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond
TLDR: Short shareable IDs need Base62 (URL shorteners). Database primary keys at scale need time-ordered IDs (Snowflake, UUID v7). Security tokens need random IDs (UUID v4, NanoID). Picking the wrong
26 min read

Article 23
Write-Time vs Read-Time Fan-Out: How Social Feeds Scale
TLDR: Fan-out is the act of distributing one post to many followers' feeds. Write-time fan-out (push) pre-computes feeds at post time β fast reads but catastrophic write amplification for celebrities.
18 min read
Article 24
Real-Time Communication: WebSockets, SSE, and Long Polling Explained
TLDR: π WebSockets = bidirectional persistent channel β use for chat, gaming, collaborative editing. SSE = one-way server push over HTTP with built-in reconnect β use for AI streaming, live logs, not
23 min read
Article 25
Microservices Architecture: Decomposition, Communication, and Trade-offs
TLDR: Microservices let teams deploy and scale services independently β but every service boundary you draw costs you a network hop, a consistency challenge, and an operational burden. The architectur
22 min read

Article 26
System Design HLD Example: Web Crawler
TLDR: A distributed web crawler must balance global throughput with per-domain politeness. The architectural crux is the URL Frontier, which manages priority and rate-limiting across a distributed fet
18 min read
Article 27
System Design HLD Example: Video Streaming (YouTube/Netflix)
TLDR: A video streaming platform is a two-sided architectural beast: a batch-oriented transcoding pipeline that converts raw uploads into multi-resolution segments, and a real-time global delivery net
17 min read
Article 28
System Design HLD Example: Ride-Sharing (Uber/Lyft)
TLDR: A ride-sharing platform is a high-velocity geospatial matching engine. Drivers stream GPS coordinates every 5 seconds into a Redis Geospatial Index. When a rider requests a trip, the Matching Se
16 min read
Article 29
System Design HLD Example: Proximity Service (Yelp/Google Places)
TLDR: A proximity service (Yelp/Google Places) solves the 2D search problem by encoding locations into Geohash strings, which are indexed in a standard B-tree. To guarantee results near grid boundarie
17 min read
Article 30
System Design HLD Example: Real-Time Leaderboard
TLDR: Real-time leaderboards for 10M+ active users require an in-memory ranking engine. Redis Sorted Sets (ZSET) are the industry standard, providing \(O(\log N)\) updates and rank lookups via an inte
16 min read
Article 31
System Design HLD Example: Distributed Job Scheduler
TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a next_fire_time index. To handle multiple scheduler instances without double-firing, we use optimistic row
17 min read
Article 32
System Design HLD Example: Hotel Booking System (Airbnb)
TLDR: A robust hotel booking system must guarantee atomicity in inventory subtraction. The core trade-off is Consistency vs. Availability: we prioritize strong consistency for the booking path (Postgr
15 min read
Article 33
System Design HLD Example: E-Commerce Platform (Amazon)
TLDR: A large-scale e-commerce platform separates catalog, cart, inventory, orders, and payments into independent microservices. The core architectural challenge is Inventory Correctness during flash
13 min read
Article 34
System Design HLD Example: Collaborative Document Editing (Google Docs)
TLDR: Real-time collaborative editing relies on Operational Transformation (OT) or CRDTs to resolve concurrent edits without data loss. The core trade-off is Latency vs. Consistency: we use optimistic
14 min read
Article 35
Distributed Transactions: 2PC, Saga, and XA Explained
TLDR: Distributed transactions require you to choose a consistency model before choosing a protocol. 2PC and XA give atomic all-or-nothing commits but block all participants on coordinator failure. Sa
26 min read

Article 36
System Design HLD Example: URL Shortener (TinyURL and Bitly)
TLDR: A URL shortener is a read-heavy system (100:1 ratio) that maps long URLs to short, unique aliases. The core scaling challenge is generating unique IDs without database contentionβsolved using a
19 min read
Article 37
System Design HLD Example: Search Autocomplete (Google/Amazon)
TLDR: Search autocomplete must respond in sub-10ms to feel "instant." The core trade-off is Latency vs. Data Freshness: we use an offline pipeline (Spark) to pre-calculate prefix-to-suggestion mapping
15 min read
Article 38
System Design HLD Example: Distributed Rate Limiter
TLDR: A distributed rate limiter protects APIs from abuse and "noisy neighbors" by enforcing request quotas across a cluster of servers. The core technical challenge is Atomic State Managementβsolved
19 min read
Article 39
System Design HLD Example: Payment Processing Platform
TLDR: Payment systems optimize for correctness first, then throughput. This guide covers idempotency, double-entry ledgers, and reconciliation. Stripe processes over 250 million API requests per day,
18 min read
Article 40
System Design HLD Example: Notification Service (Email, SMS, Push)
TLDR: A notification platform routes events to per-channel Kafka queues, deduplicates with Redis, and tracks delivery via webhooks β ensuring that critical alerts like password resets never get blocke
19 min read
Article 41
System Design HLD Example: News Feed (Home Timeline)
TLDR: A news feed system builds personalized timelines by combining content publishing, graph relationships, and ranking. The scalability crux is the fan-out amplified write path: a single celebrity p
20 min read
Article 42
System Design HLD Example: File Storage and Sync (Dropbox and Google Drive)
TLDR: Cloud sync systems separate immutable blob storage (S3) from atomic metadata operations (PostgreSQL), using chunk-level deduplication to optimize storage costs and delta-sync events to minimize
18 min read
Article 43
System Design HLD Example: Distributed Cache Platform
TLDR: Distributed caches trade strict consistency for sub-millisecond read latency, using consistent hashing to scale horizontally without causing database-shattering "cache stampedes" during cluster
15 min read
Article 44
System Design HLD Example: Chat and Messaging Platform
TLDR: A distributed chat system must balance low-latency delivery with strong per-conversation ordering. The architectural crux is the WebSocket Gateway for persistent stateful connections and Cassand
19 min read
Article 45
System Design HLD Example: API Gateway for Microservices
TLDR: An API Gateway centralizes "cross-cutting concerns" like authentication, rate limiting, and routing at the edge of your infrastructure. The architectural crux is the separation of the Control Pl
16 min read
Article 46
System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances
TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infr
13 min read
Article 47
System Design Observability, SLOs, and Incident Response: Operating Systems You Can Trust
TLDR: Observability is how you understand system behavior from telemetry, SLOs are explicit reliability targets, and incident response is the execution model when those targets are at risk. Together,
13 min read
Article 48
System Design Message Queues and Event-Driven Architecture: Building Reliable Asynchronous Systems
TLDR: Message queues and event-driven architecture let services communicate asynchronously, absorb bursty traffic, and isolate failures. The core design challenge is not adding a queue β it is definin
14 min read
Article 49
System Design Sharding Strategy: Choosing Keys, Avoiding Hot Spots, and Resharding Safely
TLDR: Sharding means splitting one logical dataset across multiple physical databases so no single node carries all the data and traffic. The hard part is not adding more nodes. The hard part is choos
13 min read
Article 50
System Design Requirements and Constraints: Ask Better Questions Before You Draw
TLDR: In system design interviews, weak answers fail early because requirements are fuzzy. Strong answers start by turning vague prompts into explicit functional scope, measurable non-functional targe
11 min read
Article 51
System Design Replication and Failover: Keep Services Alive When a Primary Dies
TLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica,
15 min read
Article 52
System Design Multi-Region Deployment: Latency, Failover, and Consistency Across Regions
TLDR: Multi-region deployment means running the same system across more than one geographic region so users get lower latency and the business can survive a regional outage. The design challenge is no
13 min read
Article 53
System Design Interview Basics: A Beginner-Friendly Framework for Clear Answers
TLDR: System design interviews are not about inventing a perfect architecture on the spot. They are about showing a calm, repeatable process: clarify requirements, estimate scale, sketch a simple desi
13 min read
Article 54
System Design Data Modeling and Schema Evolution: Query-Driven Storage That Survives Change
TLDR: In system design interviews, data modeling is where architecture meets reality. A good model starts from query patterns, chooses clear entity boundaries, defines indexes deliberately, and includ
14 min read
Article 55
System Design API Design for Interviews: Contracts, Idempotency, and Pagination
TLDR: In system design interviews, API design is not a list of HTTP verbs. It is a contract strategy: clear resource boundaries, stable request and response shapes, pagination, idempotency, error sema
12 min read
Article 56
The Role of Data in Precise Capacity Estimations for System Design
TLDR: Capacity estimation is the skill of back-of-the-envelope math that tells you whether your system design will survive its traffic before you write a line of code. Four numbers do most of the work
14 min read
Article 57
System Design: Complete Guide to Caching β Patterns, Eviction, and Distributed Strategies
TLDR: Caching is the single highest-leverage performance tool in distributed systems. This guide covers every read/write pattern (Cache-Aside through Refresh-Ahead), every eviction policy (LRU through
33 min read
Article 58
System Design Advanced: Security, Rate Limiting, and Reliability
TLDR: Three reliability tools every backend system needs: Rate Limiting prevents API spam and DDoS, Circuit Breakers stop cascading failures when downstream services degrade, and Bulkheads isolate fai
16 min read

Article 59
Little's Law: The Secret Formula for System Performance
TLDR: Little's Law (\(L = \lambda W\)) connects three metrics every system designer measures: \(L\) = concurrent requests in flight, \(\lambda\) = throughput (RPS), \(W\) = average response time. If l
9 min read
Article 60
The 8 Fallacies of Distributed Systems
TLDR TLDR: In 1994, L. Peter Deutsch at Sun Microsystems listed 8 assumptions that developers make about distributed systems β all of which are false. Believing them leads to hard-to-reproduce bugs,
13 min read
Article 61
Elasticsearch vs Time-Series DB: Key Differences Explained
TLDR: Elasticsearch is built for search β full-text log queries, fuzzy matching, and relevance ranking via an inverted index. InfluxDB and Prometheus are built for metrics β numeric time series with a
14 min read

Article 62
API Gateway vs. Load Balancer vs. Reverse Proxy: What's the Difference?
TLDR: A Reverse Proxy hides your servers and handles caching/SSL. A Load Balancer spreads traffic across server instances. An API Gateway manages API concerns β auth, rate limiting, routing, and proto
14 min read

Article 63
System Design Databases: SQL vs NoSQL and Scaling
TLDR: SQL gives you ACID guarantees and powerful relational queries; NoSQL gives you horizontal scale and flexible schemas. The real decision is not "which is better" β it is "which trade-offs align w
15 min read

Article 64
System Design Protocols: REST, RPC, and TCP/UDP
TLDR: π― Use REST (HTTP + JSON) for public, browser-facing APIs where interoperability matters. Choose gRPC (HTTP/2 + Protobuf) for internal microservice communication when latency counts. Under the h
17 min read

Article 65
System Design Networking: DNS, CDNs, and Load Balancers
TLDR: When you hit a URL, DNS translates the name to an IP, CDNs serve static assets from the edge nearest to you, and Load Balancers spread traffic across many servers so no single machine becomes a
15 min read

Article 66
System Design Core Concepts: Scalability, CAP, and Consistency
TLDR: π Scalability, the CAP Theorem, and consistency models are the three concepts that determine whether a distributed system can grow, stay reliable, and deliver correct results. Get these three r
14 min read

Article 67
The Ultimate Guide to Acing the System Design Interview
TLDR: System Design interviews are collaborative whiteboard sessions, not trick-question coding tests. Follow the framework β Requirements β Estimations β API β Data Model β High-Level Architecture β
16 min read
System Design Interview Prep: Learning Roadmap
Most engineers don't fail system design interviews because they lack content β they fail because they read topics in the wrong order. Sharding before access patterns, consensus papers before requirements, tools before trade-offs. This roadmap fixes that with a dependency-first learning path organized into three tracks based on your timeline.
TLDR: This roadmap organizes 52 system design posts into three learning paths: a 2-week interview sprint, a 4-week backend depth plan, and full mastery β covering foundations, APIs, data, async/microservices, and 19 HLD worked examples.
What You'll Learn
Understand System Design Interview Prep through real published examples
Follow a sequence of 67 articles from fundamentals to deeper topics
Connect related concepts: System Design, Databases, NoSQL
Practice explaining trade-offs and implementation decisions
Prerequisites
FAQs
How should I read this series?
Start from the first article if you are new, or use the article list to jump into the most relevant topic.
Is progress automatic?
Progress is based on articles opened from this browser using the local learning history.