Series

System Design Interview Prep

A comprehensive series to help you design scalable, reliable, and fault-tolerant systems.

67

Articles

21h 19m

Estimated reading

Intermediate to Advanced

Knowledge level

2,185

Readers

Start Series

About this series

A comprehensive series to help you design scalable, reliable, and fault-tolerant systems.

Learn with real world examples
Connect articles into a structured path
Best practices and trade-offs
Interview focused insights
Continuously updated content

Series Progress

0% Complete

0 of 67 articles viewed

Continue Learning

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

Article 1 of 67

Continue Reading

Who is this for?

Software engineers and developers learning this topic.

Knowledge Level

Intermediate to Advanced

Last Updated

May 30, 2026

A

Created by

Abstract Algorithms

All Articles

Article 1

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node β€” virtual nod

24 min read

Article 2

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions β€” but under load, across datacenters, or aft

19 min read

Article 3

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redi

25 min read

Article 4

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader β€” each accepting writes the other never sees. Prevent it with quorum consensus (at lea

22 min read

Article 5

SQL Partitioning: Range, Hash, List, and Composite Strategies Explained

TLDR: SQL partitioning divides one logical table into smaller physical child tables, all accessed through the parent table name. The query optimizer skips irrelevant child tables entirely β€” a process

25 min read

Article 6

CosmosDB Partition Internals: Logical vs Physical Partitions Explained

πŸ”₯ When Your Database Bill Triples Overnight A retail engineering team ships a flash-sale feature. Traffic spikes 10Γ—. Their Azure CosmosDB bill triples within 24 hours. Queries that ran in 5ms now ta

16 min read

Article 7

Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning

TLDR: Partitioning splits one logical table into smaller physical pieces. The database skips irrelevant pieces entirely β€” turning a 30-second full-table scan into a sub-second single-partition read. S

12 min read

Article 8

Dirty Write Explained: When Uncommitted Data Gets Overwritten

TLDR: A dirty write occurs when Transaction B overwrites data that Transaction A has written but not yet committed. The result is not a rollback or an error β€” it is silently inconsistent committed dat

28 min read

Article 9

Read Skew Explained: Inconsistent Snapshots Across Multiple Objects

TLDR: Read skew occurs when a transaction reads two logically related objects at different points in time β€” one before and one after a concurrent transaction commits β€” producing a view that never exis

34 min read

Article 10

Lost Update Explained: When Two Writes Become One

TLDR: A lost update occurs when two concurrent read-modify-write transactions both read the same committed value, both compute a new value from it, and both write back β€” with the second write silently

38 min read

Article 11

Phantom Read Explained: When New Rows Appear Mid-Transaction

TLDR: A phantom read occurs when a transaction runs the same range query twice and gets a different set of rows β€” because a concurrent transaction inserted or deleted matching rows and committed in be

32 min read

Article 12

Write Skew Explained: The Anomaly That Requires Serializable Isolation

TLDR: Write skew is the hardest concurrency anomaly to reason about: two concurrent transactions each read a shared condition, decide they can safely proceed, and then write to different rows. No indi

23 min read

Article 13

Dirty Read Explained: How Uncommitted Data Corrupts Transactions

TLDR: A dirty read occurs when Transaction B reads data written by Transaction A before A has committed. If A rolls back, B has made decisions on data that β€” from the database's perspective β€” never ex

30 min read

Article 14

Non-Repeatable Read Explained: When the Same Query Returns Different Results

TLDR: A non-repeatable read happens when the same SELECT returns different results within a single transaction because a concurrent transaction committed an update between the two reads. Read Committe

26 min read

Article 15

Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More

TLDR: Distributed systems produce anomalies not because the code is buggy β€” but because physics makes perfect consistency impossible across network boundaries. Split brain, stale reads, clock skew, ca

13 min read

Article 16

Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared

TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose β€” range, hash, consistent hashing, or di

29 min read

Article 17

Isolation Levels in Databases: Read Committed, Repeatable Read, Snapshot, and Serializable Explained

TLDR: Isolation levels control which concurrency anomalies a transaction can see. Read Committed (PostgreSQL and Oracle's default) prevents dirty reads but still silently allows non-repeatable reads,

28 min read

Article 18

Key Terms in Distributed Systems: The Definitive Glossary

TLDR: Distributed systems vocabulary is precise for a reason. Mixing up read skew and write skew costs you an interview. Confusing Snapshot Isolation with Serializable costs you a production outage. T

51 min read

Article 19

Database Anomalies: How SQL and NoSQL Handle Dirty Reads, Phantom Reads, and Write Skew

TLDR: Database anomalies are the predictable side-effects of concurrent transactions β€” dirty reads, phantom reads, write skew, and lost updates. SQL databases use MVCC and isolation levels to prevent

31 min read

Article 20

Designing for High Availability: The Road to 99.99% Reliability

TLDR: High Availability (HA) is the art of eliminating Single Points of Failure (SPOFs). By using Active-Active redundancy, automated health checks, and global failover via GSLB, you can achieve "Four

9 min read

Article 21

Choosing the Right Database: CAP Theorem and Practical Use Cases

TLDR: Database selection is a trade-off between consistency, availability, and scalability. By using the CAP Theorem as a compass and matching your data access patterns to the right storage engine (Re

7 min read

Article 22

ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond

TLDR: Short shareable IDs need Base62 (URL shorteners). Database primary keys at scale need time-ordered IDs (Snowflake, UUID v7). Security tokens need random IDs (UUID v4, NanoID). Picking the wrong

26 min read

Article 23

Write-Time vs Read-Time Fan-Out: How Social Feeds Scale

TLDR: Fan-out is the act of distributing one post to many followers' feeds. Write-time fan-out (push) pre-computes feeds at post time β€” fast reads but catastrophic write amplification for celebrities.

18 min read

Article 24

Real-Time Communication: WebSockets, SSE, and Long Polling Explained

TLDR: πŸ”Œ WebSockets = bidirectional persistent channel β€” use for chat, gaming, collaborative editing. SSE = one-way server push over HTTP with built-in reconnect β€” use for AI streaming, live logs, not

23 min read

Article 25

Microservices Architecture: Decomposition, Communication, and Trade-offs

TLDR: Microservices let teams deploy and scale services independently β€” but every service boundary you draw costs you a network hop, a consistency challenge, and an operational burden. The architectur

22 min read

Article 26

System Design HLD Example: Web Crawler

TLDR: A distributed web crawler must balance global throughput with per-domain politeness. The architectural crux is the URL Frontier, which manages priority and rate-limiting across a distributed fet

18 min read

Article 27

System Design HLD Example: Video Streaming (YouTube/Netflix)

TLDR: A video streaming platform is a two-sided architectural beast: a batch-oriented transcoding pipeline that converts raw uploads into multi-resolution segments, and a real-time global delivery net

17 min read

Article 28

System Design HLD Example: Ride-Sharing (Uber/Lyft)

TLDR: A ride-sharing platform is a high-velocity geospatial matching engine. Drivers stream GPS coordinates every 5 seconds into a Redis Geospatial Index. When a rider requests a trip, the Matching Se

16 min read

Article 29

System Design HLD Example: Proximity Service (Yelp/Google Places)

TLDR: A proximity service (Yelp/Google Places) solves the 2D search problem by encoding locations into Geohash strings, which are indexed in a standard B-tree. To guarantee results near grid boundarie

17 min read

Article 30

System Design HLD Example: Real-Time Leaderboard

TLDR: Real-time leaderboards for 10M+ active users require an in-memory ranking engine. Redis Sorted Sets (ZSET) are the industry standard, providing \(O(\log N)\) updates and rank lookups via an inte

16 min read

Article 31

System Design HLD Example: Distributed Job Scheduler

TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a next_fire_time index. To handle multiple scheduler instances without double-firing, we use optimistic row

17 min read

Article 32

System Design HLD Example: Hotel Booking System (Airbnb)

TLDR: A robust hotel booking system must guarantee atomicity in inventory subtraction. The core trade-off is Consistency vs. Availability: we prioritize strong consistency for the booking path (Postgr

15 min read

Article 33

System Design HLD Example: E-Commerce Platform (Amazon)

TLDR: A large-scale e-commerce platform separates catalog, cart, inventory, orders, and payments into independent microservices. The core architectural challenge is Inventory Correctness during flash

13 min read

Article 34

System Design HLD Example: Collaborative Document Editing (Google Docs)

TLDR: Real-time collaborative editing relies on Operational Transformation (OT) or CRDTs to resolve concurrent edits without data loss. The core trade-off is Latency vs. Consistency: we use optimistic

14 min read

Article 35

Distributed Transactions: 2PC, Saga, and XA Explained

TLDR: Distributed transactions require you to choose a consistency model before choosing a protocol. 2PC and XA give atomic all-or-nothing commits but block all participants on coordinator failure. Sa

26 min read

Article 36

System Design HLD Example: URL Shortener (TinyURL and Bitly)

TLDR: A URL shortener is a read-heavy system (100:1 ratio) that maps long URLs to short, unique aliases. The core scaling challenge is generating unique IDs without database contentionβ€”solved using a

19 min read

Article 37

System Design HLD Example: Search Autocomplete (Google/Amazon)

TLDR: Search autocomplete must respond in sub-10ms to feel "instant." The core trade-off is Latency vs. Data Freshness: we use an offline pipeline (Spark) to pre-calculate prefix-to-suggestion mapping

15 min read

Article 38

System Design HLD Example: Distributed Rate Limiter

TLDR: A distributed rate limiter protects APIs from abuse and "noisy neighbors" by enforcing request quotas across a cluster of servers. The core technical challenge is Atomic State Managementβ€”solved

19 min read

Article 39

System Design HLD Example: Payment Processing Platform

TLDR: Payment systems optimize for correctness first, then throughput. This guide covers idempotency, double-entry ledgers, and reconciliation. Stripe processes over 250 million API requests per day,

18 min read

Article 40

System Design HLD Example: Notification Service (Email, SMS, Push)

TLDR: A notification platform routes events to per-channel Kafka queues, deduplicates with Redis, and tracks delivery via webhooks β€” ensuring that critical alerts like password resets never get blocke

19 min read

Article 41

System Design HLD Example: News Feed (Home Timeline)

TLDR: A news feed system builds personalized timelines by combining content publishing, graph relationships, and ranking. The scalability crux is the fan-out amplified write path: a single celebrity p

20 min read

Article 42

System Design HLD Example: File Storage and Sync (Dropbox and Google Drive)

TLDR: Cloud sync systems separate immutable blob storage (S3) from atomic metadata operations (PostgreSQL), using chunk-level deduplication to optimize storage costs and delta-sync events to minimize

18 min read

Article 43

System Design HLD Example: Distributed Cache Platform

TLDR: Distributed caches trade strict consistency for sub-millisecond read latency, using consistent hashing to scale horizontally without causing database-shattering "cache stampedes" during cluster

15 min read

Article 44

System Design HLD Example: Chat and Messaging Platform

TLDR: A distributed chat system must balance low-latency delivery with strong per-conversation ordering. The architectural crux is the WebSocket Gateway for persistent stateful connections and Cassand

19 min read

Article 45

System Design HLD Example: API Gateway for Microservices

TLDR: An API Gateway centralizes "cross-cutting concerns" like authentication, rate limiting, and routing at the edge of your infrastructure. The architectural crux is the separation of the Control Pl

16 min read

Article 46

System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances

TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infr

13 min read

Article 47

System Design Observability, SLOs, and Incident Response: Operating Systems You Can Trust

TLDR: Observability is how you understand system behavior from telemetry, SLOs are explicit reliability targets, and incident response is the execution model when those targets are at risk. Together,

13 min read

Article 48

System Design Message Queues and Event-Driven Architecture: Building Reliable Asynchronous Systems

TLDR: Message queues and event-driven architecture let services communicate asynchronously, absorb bursty traffic, and isolate failures. The core design challenge is not adding a queue β€” it is definin

14 min read

Article 49

System Design Sharding Strategy: Choosing Keys, Avoiding Hot Spots, and Resharding Safely

TLDR: Sharding means splitting one logical dataset across multiple physical databases so no single node carries all the data and traffic. The hard part is not adding more nodes. The hard part is choos

13 min read

Article 50

System Design Requirements and Constraints: Ask Better Questions Before You Draw

TLDR: In system design interviews, weak answers fail early because requirements are fuzzy. Strong answers start by turning vague prompts into explicit functional scope, measurable non-functional targe

11 min read

Article 51

System Design Replication and Failover: Keep Services Alive When a Primary Dies

TLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica,

15 min read

Article 52

System Design Multi-Region Deployment: Latency, Failover, and Consistency Across Regions

TLDR: Multi-region deployment means running the same system across more than one geographic region so users get lower latency and the business can survive a regional outage. The design challenge is no

13 min read

Article 53

System Design Interview Basics: A Beginner-Friendly Framework for Clear Answers

TLDR: System design interviews are not about inventing a perfect architecture on the spot. They are about showing a calm, repeatable process: clarify requirements, estimate scale, sketch a simple desi

13 min read

Article 54

System Design Data Modeling and Schema Evolution: Query-Driven Storage That Survives Change

TLDR: In system design interviews, data modeling is where architecture meets reality. A good model starts from query patterns, chooses clear entity boundaries, defines indexes deliberately, and includ

14 min read

Article 55

System Design API Design for Interviews: Contracts, Idempotency, and Pagination

TLDR: In system design interviews, API design is not a list of HTTP verbs. It is a contract strategy: clear resource boundaries, stable request and response shapes, pagination, idempotency, error sema

12 min read

Article 56

The Role of Data in Precise Capacity Estimations for System Design

TLDR: Capacity estimation is the skill of back-of-the-envelope math that tells you whether your system design will survive its traffic before you write a line of code. Four numbers do most of the work

14 min read

Article 57

System Design: Complete Guide to Caching β€” Patterns, Eviction, and Distributed Strategies

TLDR: Caching is the single highest-leverage performance tool in distributed systems. This guide covers every read/write pattern (Cache-Aside through Refresh-Ahead), every eviction policy (LRU through

33 min read

Article 58

System Design Advanced: Security, Rate Limiting, and Reliability

TLDR: Three reliability tools every backend system needs: Rate Limiting prevents API spam and DDoS, Circuit Breakers stop cascading failures when downstream services degrade, and Bulkheads isolate fai

16 min read

Article 59

Little's Law: The Secret Formula for System Performance

TLDR: Little's Law (\(L = \lambda W\)) connects three metrics every system designer measures: \(L\) = concurrent requests in flight, \(\lambda\) = throughput (RPS), \(W\) = average response time. If l

9 min read

Article 60

The 8 Fallacies of Distributed Systems

TLDR TLDR: In 1994, L. Peter Deutsch at Sun Microsystems listed 8 assumptions that developers make about distributed systems β€” all of which are false. Believing them leads to hard-to-reproduce bugs,

13 min read

Article 61

Elasticsearch vs Time-Series DB: Key Differences Explained

TLDR: Elasticsearch is built for search β€” full-text log queries, fuzzy matching, and relevance ranking via an inverted index. InfluxDB and Prometheus are built for metrics β€” numeric time series with a

14 min read

Article 62

API Gateway vs. Load Balancer vs. Reverse Proxy: What's the Difference?

TLDR: A Reverse Proxy hides your servers and handles caching/SSL. A Load Balancer spreads traffic across server instances. An API Gateway manages API concerns β€” auth, rate limiting, routing, and proto

14 min read

Article 63

System Design Databases: SQL vs NoSQL and Scaling

TLDR: SQL gives you ACID guarantees and powerful relational queries; NoSQL gives you horizontal scale and flexible schemas. The real decision is not "which is better" β€” it is "which trade-offs align w

15 min read

Article 64

System Design Protocols: REST, RPC, and TCP/UDP

TLDR: 🎯 Use REST (HTTP + JSON) for public, browser-facing APIs where interoperability matters. Choose gRPC (HTTP/2 + Protobuf) for internal microservice communication when latency counts. Under the h

17 min read

Article 65

System Design Networking: DNS, CDNs, and Load Balancers

TLDR: When you hit a URL, DNS translates the name to an IP, CDNs serve static assets from the edge nearest to you, and Load Balancers spread traffic across many servers so no single machine becomes a

15 min read

Article 66

System Design Core Concepts: Scalability, CAP, and Consistency

TLDR: πŸš€ Scalability, the CAP Theorem, and consistency models are the three concepts that determine whether a distributed system can grow, stay reliable, and deliver correct results. Get these three r

14 min read

Article 67

The Ultimate Guide to Acing the System Design Interview

TLDR: System Design interviews are collaborative whiteboard sessions, not trick-question coding tests. Follow the framework β€” Requirements β†’ Estimations β†’ API β†’ Data Model β†’ High-Level Architecture β†’

16 min read

System Design Interview Prep: Learning Roadmap

Most engineers don't fail system design interviews because they lack content β€” they fail because they read topics in the wrong order. Sharding before access patterns, consensus papers before requirements, tools before trade-offs. This roadmap fixes that with a dependency-first learning path organized into three tracks based on your timeline.

TLDR: This roadmap organizes 52 system design posts into three learning paths: a 2-week interview sprint, a 4-week backend depth plan, and full mastery β€” covering foundations, APIs, data, async/microservices, and 19 HLD worked examples.

What You'll Learn

Understand System Design Interview Prep through real published examples

Follow a sequence of 67 articles from fundamentals to deeper topics

Connect related concepts: System Design, Databases, NoSQL

Practice explaining trade-offs and implementation decisions

Prerequisites

Basic backend engineering knowledge
Familiarity with APIs, databases, and caching
Comfort reading architecture trade-offs

FAQs

How should I read this series?

Start from the first article if you are new, or use the article list to jump into the most relevant topic.

Is progress automatic?

Progress is based on articles opened from this browser using the local learning history.