System Design Replication and Failover: Keep Services Alive When a Primary Dies

A clear guide to replicas, failover detection, replica lag, and the trade-offs behind reliable database designs.

System Design Interview Prep

Abstract Algorithms

·Mar 12, 2026·14 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 14 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica, rerouting traffic, and recovering without corrupting data.

TLDR: If one database going down takes your product down, replication is the next design step. Failover is how you make that redundancy usable.

📖 Why Replication Shows Up in Every Serious System Design Conversation

In a beginner system design interview, a single database is often a perfectly fine starting point. It is easy to explain, easy to operate, and often good enough for the first version of a product.

The trouble starts when that one database becomes both the storage layer and the single point of failure. If the node crashes, the application can still be healthy, the cache can still be warm, and the API can still be reachable, but users will experience an outage because the source of truth is gone.

That is why replication usually enters the conversation right after the simple design. If you came here from System Design Interview Basics, this is the deeper follow-up to the line "add replication for reliability."

Replication answers one question: how do we keep another copy of the same data available elsewhere? Failover answers the next question: how do we decide which copy becomes authoritative when the primary fails?

Without replication	With replication
One node holds all writes and all reads	A primary handles writes while replicas provide redundancy
Any crash can cause a full outage	A crash can trigger promotion of a healthy replica
Reads compete with writes on one machine	Reads can be spread across replicas
Recovery depends on backups alone	Recovery can use already-running replicas

The crucial interview insight is that replication is not just about uptime. It also affects latency, read scale, consistency, recovery complexity, and operational risk.

🔍 Primary, Replicas, and the Three Questions You Must Answer

Every replicated system has to answer three practical questions.

Question 1: Where do writes go? In the simplest model, all writes go to a single primary node. The primary serializes changes and ships them to replicas.

Question 2: Where do reads go? Some systems send all reads to the primary for strong consistency. Others send read traffic to replicas to reduce load. That improves scale, but it creates the risk of stale reads when replicas lag.

Question 3: Who decides that the primary is dead? This is the core of failover. If every node can promote itself independently, split-brain becomes possible. A good design needs a leader election rule, quorum, or external coordinator.

Here is the common vocabulary:

Term	Meaning	Why it matters
Primary	The authoritative node for writes	Keeps write ordering simple
Replica	A follower that replays changes from the primary	Adds redundancy and read scale
Synchronous replication	Primary waits for one or more replicas before acknowledging commit	Safer, but slower writes
Asynchronous replication	Primary acknowledges first and replicas catch up later	Faster, but can lose recent writes on failover
Failover	Promote a replica and redirect traffic	Restores service after primary loss

This is why replication is always a trade-off discussion. The more aggressively you protect against data loss, the more latency and coordination you usually add.

⚙️ How Writes, Replicas, and Promotion Actually Work

At a high level, replicated databases usually follow a log-driven pattern.

A client sends a write to the primary.
The primary appends the change to a durable log.
The primary applies the change locally.
Replicas receive the change log and replay it in order.
Reads may go to the primary, to replicas, or to both.

That sounds simple until you trace the timing.

Step	What happens	Operational consequence
`t0`	Client sends `INSERT order_123`	User is waiting for confirmation
`t1`	Primary writes to its local WAL/binlog	Durability starts here
`t2`	Primary optionally waits for replica acknowledgment	Higher safety, higher latency
`t3`	Primary replies success to client	User sees commit complete
`t4`	Replica applies the change	Read replicas may still be behind

If the primary crashes between t3 and t4, the answer depends on the replication mode. With asynchronous replication, the newest write may not exist on any surviving replica. With synchronous replication, the client waited longer, but the system is more likely to preserve that write during failover.

That single timing gap is the reason replication discussions matter in interviews. A strong answer does not just say "I would add replicas." It says whether those replicas are protecting availability, read throughput, or data durability.

🧠 Deep Dive: What Makes Failover Safe Instead of Chaotic

Failover is not merely switching traffic to another box. It is a correctness problem disguised as an availability feature.

The Internals: Write-Ahead Logs, Heartbeats, and Promotion

Most production databases replicate ordered change records rather than shipping whole tables after every write. PostgreSQL uses a write-ahead log. MySQL uses binlogs. The idea is the same: replicas apply the same ordered stream of changes to converge toward primary state.

On top of that replication stream, the system needs health signals. Common signals include:

Heartbeats between nodes.
A replication lag metric such as seconds behind primary.
A quorum-based lease or vote so only one node can become leader.

When the primary appears unhealthy, a failover controller typically asks three questions:

Is the primary really unreachable, or just slow?
Which replica is most up to date?
Can we promote exactly one replica without creating split-brain?

That third question is the dangerous one. If two replicas promote themselves in different network partitions, clients may write divergent states. Merging that later is painful or impossible depending on the workload.

This is why many systems use a consensus service or a strong election protocol. Even if you do not mention every technology by name in an interview, you should explain that leader promotion needs coordination, not guesswork.

Performance Analysis: Commit Latency, Replica Lag, and Read Scaling

Replication changes both write and read performance.

Write latency: synchronous replication increases the commit path because the primary waits for confirmation from one or more replicas. If the network is slow or one zone is unhealthy, p95 and p99 write latency rise.

Read scale: replicas are great for read-heavy workloads such as catalogs, dashboards, and reporting. They let you separate hot reads from critical writes.

Replica lag: asynchronous replicas can trail the primary by milliseconds or seconds. That means a user could create an object successfully and then not see it immediately if the next read goes to a lagging replica.

Metric	What to watch	Why it matters
Commit latency	p95 and p99 write time	Replication can slow the critical path
Replica lag	Seconds or log positions behind primary	Determines staleness of reads
Read QPS per replica	Distribution of traffic	Shows whether replicas are actually offloading the primary
Failover time	Detection + promotion + reroute	Directly affects user-visible downtime

For interviews, the cleanest summary is this: replication improves resilience and read scale, but it forces you to choose how much stale data and write latency you can tolerate.

📊 The Failover Journey From Normal Writes to Recovery

flowchart TD
  A[Client sends write] --> B[Primary appends WAL or binlog]
  B --> C[Primary applies change locally]
  C --> D[Replica receives and replays log]
  D --> E{Primary healthy?}
  E -->|Yes| F[Keep serving writes from primary]
  E -->|No| G[Failover controller checks replica freshness]
  G --> H[Promote healthiest replica]
  H --> I[Route new writes to promoted node]

This diagram hides a lot of detail, but it captures the correct story for an interview: writes are ordered, replicated, evaluated for freshness, and then rerouted through a controller that promotes one healthy node.

📊 Leader Election After Primary Failure

sequenceDiagram
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant L as Leader
    participant Q as Quorum
    L->>F1: Heartbeat
    L->>F2: Heartbeat
    Note over L: Leader crashes
    F1->>F1: Heartbeat timeout
    F1->>Q: Request votes term+1
    F2->>Q: Request votes term+1
    Q-->>F1: Majority votes granted
    F1->>F2: I am new leader
    F1->>F2: Begin log replication

This sequence diagram shows how a Raft-style leader election proceeds after primary failure. Follower 1 and Follower 2 detect a heartbeat timeout from the crashed leader, then both request votes for term+1 from the quorum. Follower 1 wins the majority vote, declares itself the new leader, and immediately begins log replication to Follower 2. The key takeaway is that quorum voting prevents split-brain even when multiple followers detect failure simultaneously.

📊 Replica State Transitions

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate: Heartbeat timeout
    Candidate --> Leader: Wins election
    Candidate --> Follower: Higher term seen
    Leader --> Follower: Higher term seen
    Leader --> [*]: Graceful shutdown

This state diagram captures the complete lifecycle of a Raft replica. A new node enters as a Follower and transitions to Candidate on heartbeat timeout, then either wins election to become Leader or reverts to Follower upon encountering a higher term. An active Leader similarly steps down when a higher term is observed, and exits cleanly on graceful shutdown. Term numbers are the safety primitive that prevents two nodes from simultaneously believing they hold write authority.

🌍 Real-World Applications: Catalogs, Payment Metadata, and Control Planes

Replication appears in many different forms depending on the workload.

E-commerce catalog: product reads dominate writes, so read replicas help scale traffic while the primary handles inventory updates and editorial changes.

Payment metadata systems: even if money movement uses stronger guarantees elsewhere, supporting metadata such as transaction views, audit logs, or reconciliation dashboards often needs highly available replicas for reporting and operations.

SaaS control planes: tenants expect admin dashboards to keep working during partial failures. Replicas and automated failover keep configuration reads available while preserving a well-defined primary for updates.

GitHub (MySQL read replicas for pull request listing): GitHub serves hundreds of millions of PR list and diff queries from read replicas while keeping all writes on the MySQL primary. Their 2018 multi-hour incident involved metadata divergence between primary and replica during a failover — the replica chosen for promotion had accumulated roughly 35 seconds of replication lag, meaning recent write-ahead log entries had not yet replayed. The lesson: replica lag at the moment of failure determines your actual RPO, not average lag during quiet periods.

Shopify (automated MySQL failover with Orchestrator): Shopify's OLTP layer uses MySQL with Orchestrator for automated leader election. Orchestrator selects the most up-to-date replica by comparing GTID (Global Transaction ID) positions, promotes it, and redirects ProxySQL within < 30 seconds. Replicas lagging more than 30 seconds at promotion time are excluded from the automatic candidate pool — a hard safety boundary between automatic and manual promotion paths.

On the primary, streaming replication requires setting the WAL level to replica, configuring a maximum number of WAL sender processes for connected standbys, and declaring which replicas must acknowledge a write before the primary responds to the client. Retaining a generous buffer of WAL segments prevents slower replicas from falling too far behind during transient network hiccups. On the monitoring side, the primary exposes per-replica replication lag as the byte difference between the log position it has sent and the position each replica has actually replayed. A lag above a few tens of megabytes warrants investigation; a lag exceeding several hundred megabytes at the moment of primary failure is a signal to exclude that replica from automatic promotion until it catches up, since its effective RPO is now measured in significant data loss rather than milliseconds.

The pattern stays the same: a workload that needs durability, moderate write coordination, and meaningful read traffic benefits from replication.

⚖️ Trade-offs & Failure Modes: The Price of Reliability

Trade-off or failure mode	What goes wrong	First mitigation
Replica lag	Users read stale data	Read recent writes from primary or use read-after-write routing
Split-brain	Two nodes accept writes as leader	Use quorum or external consensus
Slow synchronous commit	Writes become slower at peak	Replicate to one synchronous follower, others async
Failover to stale replica	Recent writes disappear	Promote the most up-to-date replica only
Manual failover playbooks	Recovery takes too long	Automate promotion and traffic switch

The mature interview answer is not "replication solves outages." It is "replication changes the outage shape from hard downtime into a coordination problem I can manage with failover rules."

🧭 Decision Guide: Which Replication Pattern Fits the Problem?

Situation	Recommendation
Read-heavy app with moderate consistency needs	Single primary with read replicas
Money-critical workflow	Single primary with stronger synchronous confirmation
Global writes from many regions	Consider multi-primary only if conflicts are acceptable or carefully resolved
Small startup app	Start with backup + restore, add replication when availability becomes a real requirement

The practical interview trick is sequencing. Start simple. Add read replicas when the primary is overloaded by reads. Add automated failover when downtime costs become meaningful.

🧪 Practical Example: Evolving a URL Shortener Beyond One Database

Suppose your first URL shortener stores everything in one relational database. That is a good first answer.

But traffic grows:

Redirect reads climb into the tens of thousands per second.
Marketing campaigns create bursty hot links.
The business now cares about uptime because every outage breaks customer campaigns.

The first upgrade path looks like this:

Keep one primary for new short-link writes.
Add read replicas for redirect lookups that do not require the newest possible state.
Put hot redirects behind a cache.
Add automated failover so the primary is not a single point of outage.

That sequence is interview gold because it shows evolution rather than premature complexity. It also connects directly back to System Design Interview Basics: start simple, then justify the next layer only when the bottleneck appears.

🛠️ Spring Boot and HikariCP: Read Replica Routing at the Configuration Level

Spring Boot supports multiple DataSource beans, and AbstractRoutingDataSource can transparently route database calls — @Transactional(readOnly = true) to a replica, write transactions to the primary — without changing a single line of repository or service code.

How it solves the problem: The routing decision is expressed as a single annotation (readOnly = true) and resolved at the infrastructure layer. This mirrors the replication model described in this post: reads can go to any up-to-date replica, writes must go to the primary. The service layer expresses the intent; the datasource layer enforces it.

The datasource configuration declares two separate connection pools — one pointing at the primary host with credentials that allow writes, and one pointing at the replica host with read-only credentials. The replica pool is typically sized larger than the primary pool because replicas exist precisely to absorb high read concurrency; a primary pool of 20 connections and a replica pool of 40 connections, for example, reflects the practical read scaling benefit that horizontal replicas provide. Spring Boot's AbstractRoutingDataSource then inspects the transaction context at runtime to decide which pool each database call should draw from.

In the service layer, annotating a read method with @Transactional(readOnly = true) signals that the operation can tolerate replication lag and does not require write ordering guarantees — the routing datasource maps this annotation to the replica connection pool automatically. A write method with a plain @Transactional annotation routes to the primary, ensuring correct ordering and full durability. This means a query listing orders by customer ID draws from the replica pool and offloads the primary, while a call to place a new order always reaches the primary without any conditional routing logic in the service or repository code itself. The readOnly flag is the only application-level change needed; it expresses the replication trade-off described throughout this post — reads accept possible lag in exchange for scale, writes demand the primary for correctness.

For a full deep-dive on AbstractRoutingDataSource, AOP-based context switching, and pgpool connection pooling for transparent replica failover, a dedicated follow-up post is planned.

📚 Lessons Learned

Replication and failover are separate ideas: one copies data, the other restores authority.
Read replicas help scale reads, but they do not guarantee fresh data.
Asynchronous replication reduces write latency but can lose the newest acknowledged data during a crash.
Safe failover depends on leader election and freshness checks, not just health probes.
A strong interview answer explains both availability gains and correctness risks.

📌 TLDR: Summary & Key Takeaways

Replication keeps multiple copies of data so the system can survive node loss.
Failover promotes a healthy replica and reroutes traffic after primary failure.
Synchronous replication favors safety; asynchronous replication favors speed.
Replica lag, split-brain, and stale failover targets are the main risks to discuss.
The best interview answers introduce replication only when reliability or read scale actually require it.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...

May 3, 2026•23 min read

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...

May 3, 2026•20 min read

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...

May 3, 2026•18 min read

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...

May 3, 2026•22 min read