System Design Replication and Failover: Keep Services Alive When a Primary Dies
A clear guide to replicas, failover detection, replica lag, and the trade-offs behind reliable database designs.
Abstract AlgorithmsIntermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 14 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica, rerouting traffic, and recovering without corrupting data.
TLDR: If one database going down takes your product down, replication is the next design step. Failover is how you make that redundancy usable.
๐ Why Replication Shows Up in Every Serious System Design Conversation
In a beginner system design interview, a single database is often a perfectly fine starting point. It is easy to explain, easy to operate, and often good enough for the first version of a product.
The trouble starts when that one database becomes both the storage layer and the single point of failure. If the node crashes, the application can still be healthy, the cache can still be warm, and the API can still be reachable, but users will experience an outage because the source of truth is gone.
That is why replication usually enters the conversation right after the simple design. If you came here from System Design Interview Basics, this is the deeper follow-up to the line "add replication for reliability."
Replication answers one question: how do we keep another copy of the same data available elsewhere? Failover answers the next question: how do we decide which copy becomes authoritative when the primary fails?
| Without replication | With replication |
| One node holds all writes and all reads | A primary handles writes while replicas provide redundancy |
| Any crash can cause a full outage | A crash can trigger promotion of a healthy replica |
| Reads compete with writes on one machine | Reads can be spread across replicas |
| Recovery depends on backups alone | Recovery can use already-running replicas |
The crucial interview insight is that replication is not just about uptime. It also affects latency, read scale, consistency, recovery complexity, and operational risk.
๐ Primary, Replicas, and the Three Questions You Must Answer
Every replicated system has to answer three practical questions.
Question 1: Where do writes go? In the simplest model, all writes go to a single primary node. The primary serializes changes and ships them to replicas.
Question 2: Where do reads go? Some systems send all reads to the primary for strong consistency. Others send read traffic to replicas to reduce load. That improves scale, but it creates the risk of stale reads when replicas lag.
Question 3: Who decides that the primary is dead? This is the core of failover. If every node can promote itself independently, split-brain becomes possible. A good design needs a leader election rule, quorum, or external coordinator.
Here is the common vocabulary:
| Term | Meaning | Why it matters |
| Primary | The authoritative node for writes | Keeps write ordering simple |
| Replica | A follower that replays changes from the primary | Adds redundancy and read scale |
| Synchronous replication | Primary waits for one or more replicas before acknowledging commit | Safer, but slower writes |
| Asynchronous replication | Primary acknowledges first and replicas catch up later | Faster, but can lose recent writes on failover |
| Failover | Promote a replica and redirect traffic | Restores service after primary loss |
This is why replication is always a trade-off discussion. The more aggressively you protect against data loss, the more latency and coordination you usually add.
โ๏ธ How Writes, Replicas, and Promotion Actually Work
At a high level, replicated databases usually follow a log-driven pattern.
- A client sends a write to the primary.
- The primary appends the change to a durable log.
- The primary applies the change locally.
- Replicas receive the change log and replay it in order.
- Reads may go to the primary, to replicas, or to both.
That sounds simple until you trace the timing.
| Step | What happens | Operational consequence |
t0 | Client sends INSERT order_123 | User is waiting for confirmation |
t1 | Primary writes to its local WAL/binlog | Durability starts here |
t2 | Primary optionally waits for replica acknowledgment | Higher safety, higher latency |
t3 | Primary replies success to client | User sees commit complete |
t4 | Replica applies the change | Read replicas may still be behind |
If the primary crashes between t3 and t4, the answer depends on the replication mode. With asynchronous replication, the newest write may not exist on any surviving replica. With synchronous replication, the client waited longer, but the system is more likely to preserve that write during failover.
That single timing gap is the reason replication discussions matter in interviews. A strong answer does not just say "I would add replicas." It says whether those replicas are protecting availability, read throughput, or data durability.
๐ง Deep Dive: What Makes Failover Safe Instead of Chaotic
Failover is not merely switching traffic to another box. It is a correctness problem disguised as an availability feature.
The Internals: Write-Ahead Logs, Heartbeats, and Promotion
Most production databases replicate ordered change records rather than shipping whole tables after every write. PostgreSQL uses a write-ahead log. MySQL uses binlogs. The idea is the same: replicas apply the same ordered stream of changes to converge toward primary state.
On top of that replication stream, the system needs health signals. Common signals include:
- Heartbeats between nodes.
- A replication lag metric such as seconds behind primary.
- A quorum-based lease or vote so only one node can become leader.
When the primary appears unhealthy, a failover controller typically asks three questions:
- Is the primary really unreachable, or just slow?
- Which replica is most up to date?
- Can we promote exactly one replica without creating split-brain?
That third question is the dangerous one. If two replicas promote themselves in different network partitions, clients may write divergent states. Merging that later is painful or impossible depending on the workload.
This is why many systems use a consensus service or a strong election protocol. Even if you do not mention every technology by name in an interview, you should explain that leader promotion needs coordination, not guesswork.
Performance Analysis: Commit Latency, Replica Lag, and Read Scaling
Replication changes both write and read performance.
Write latency: synchronous replication increases the commit path because the primary waits for confirmation from one or more replicas. If the network is slow or one zone is unhealthy, p95 and p99 write latency rise.
Read scale: replicas are great for read-heavy workloads such as catalogs, dashboards, and reporting. They let you separate hot reads from critical writes.
Replica lag: asynchronous replicas can trail the primary by milliseconds or seconds. That means a user could create an object successfully and then not see it immediately if the next read goes to a lagging replica.
| Metric | What to watch | Why it matters |
| Commit latency | p95 and p99 write time | Replication can slow the critical path |
| Replica lag | Seconds or log positions behind primary | Determines staleness of reads |
| Read QPS per replica | Distribution of traffic | Shows whether replicas are actually offloading the primary |
| Failover time | Detection + promotion + reroute | Directly affects user-visible downtime |
For interviews, the cleanest summary is this: replication improves resilience and read scale, but it forces you to choose how much stale data and write latency you can tolerate.
๐ The Failover Journey From Normal Writes to Recovery
flowchart TD
A[Client sends write] --> B[Primary appends WAL or binlog]
B --> C[Primary applies change locally]
C --> D[Replica receives and replays log]
D --> E{Primary healthy?}
E -->|Yes| F[Keep serving writes from primary]
E -->|No| G[Failover controller checks replica freshness]
G --> H[Promote healthiest replica]
H --> I[Route new writes to promoted node]
This diagram hides a lot of detail, but it captures the correct story for an interview: writes are ordered, replicated, evaluated for freshness, and then rerouted through a controller that promotes one healthy node.
๐ Leader Election After Primary Failure
sequenceDiagram
participant F1 as Follower 1
participant F2 as Follower 2
participant L as Leader
participant Q as Quorum
L->>F1: Heartbeat
L->>F2: Heartbeat
Note over L: Leader crashes
F1->>F1: Heartbeat timeout
F1->>Q: Request votes term+1
F2->>Q: Request votes term+1
Q-->>F1: Majority votes granted
F1->>F2: I am new leader
F1->>F2: Begin log replication
This sequence diagram shows how a Raft-style leader election proceeds after primary failure. Follower 1 and Follower 2 detect a heartbeat timeout from the crashed leader, then both request votes for term+1 from the quorum. Follower 1 wins the majority vote, declares itself the new leader, and immediately begins log replication to Follower 2. The key takeaway is that quorum voting prevents split-brain even when multiple followers detect failure simultaneously.
๐ Replica State Transitions
stateDiagram-v2
[*] --> Follower
Follower --> Candidate: Heartbeat timeout
Candidate --> Leader: Wins election
Candidate --> Follower: Higher term seen
Leader --> Follower: Higher term seen
Leader --> [*]: Graceful shutdown
This state diagram captures the complete lifecycle of a Raft replica. A new node enters as a Follower and transitions to Candidate on heartbeat timeout, then either wins election to become Leader or reverts to Follower upon encountering a higher term. An active Leader similarly steps down when a higher term is observed, and exits cleanly on graceful shutdown. Term numbers are the safety primitive that prevents two nodes from simultaneously believing they hold write authority.
๐ Real-World Applications: Catalogs, Payment Metadata, and Control Planes
Replication appears in many different forms depending on the workload.
E-commerce catalog: product reads dominate writes, so read replicas help scale traffic while the primary handles inventory updates and editorial changes.
Payment metadata systems: even if money movement uses stronger guarantees elsewhere, supporting metadata such as transaction views, audit logs, or reconciliation dashboards often needs highly available replicas for reporting and operations.
SaaS control planes: tenants expect admin dashboards to keep working during partial failures. Replicas and automated failover keep configuration reads available while preserving a well-defined primary for updates.
GitHub (MySQL read replicas for pull request listing): GitHub serves hundreds of millions of PR list and diff queries from read replicas while keeping all writes on the MySQL primary. Their 2018 multi-hour incident involved metadata divergence between primary and replica during a failover โ the replica chosen for promotion had accumulated roughly 35 seconds of replication lag, meaning recent write-ahead log entries had not yet replayed. The lesson: replica lag at the moment of failure determines your actual RPO, not average lag during quiet periods.
Shopify (automated MySQL failover with Orchestrator): Shopify's OLTP layer uses MySQL with Orchestrator for automated leader election. Orchestrator selects the most up-to-date replica by comparing GTID (Global Transaction ID) positions, promotes it, and redirects ProxySQL within < 30 seconds. Replicas lagging more than 30 seconds at promotion time are excluded from the automatic candidate pool โ a hard safety boundary between automatic and manual promotion paths.
On the primary, streaming replication requires setting the WAL level to replica, configuring a maximum number of WAL sender processes for connected standbys, and declaring which replicas must acknowledge a write before the primary responds to the client. Retaining a generous buffer of WAL segments prevents slower replicas from falling too far behind during transient network hiccups. On the monitoring side, the primary exposes per-replica replication lag as the byte difference between the log position it has sent and the position each replica has actually replayed. A lag above a few tens of megabytes warrants investigation; a lag exceeding several hundred megabytes at the moment of primary failure is a signal to exclude that replica from automatic promotion until it catches up, since its effective RPO is now measured in significant data loss rather than milliseconds.
The pattern stays the same: a workload that needs durability, moderate write coordination, and meaningful read traffic benefits from replication.
โ๏ธ Trade-offs & Failure Modes: The Price of Reliability
| Trade-off or failure mode | What goes wrong | First mitigation |
| Replica lag | Users read stale data | Read recent writes from primary or use read-after-write routing |
| Split-brain | Two nodes accept writes as leader | Use quorum or external consensus |
| Slow synchronous commit | Writes become slower at peak | Replicate to one synchronous follower, others async |
| Failover to stale replica | Recent writes disappear | Promote the most up-to-date replica only |
| Manual failover playbooks | Recovery takes too long | Automate promotion and traffic switch |
The mature interview answer is not "replication solves outages." It is "replication changes the outage shape from hard downtime into a coordination problem I can manage with failover rules."
๐งญ Decision Guide: Which Replication Pattern Fits the Problem?
| Situation | Recommendation |
| Read-heavy app with moderate consistency needs | Single primary with read replicas |
| Money-critical workflow | Single primary with stronger synchronous confirmation |
| Global writes from many regions | Consider multi-primary only if conflicts are acceptable or carefully resolved |
| Small startup app | Start with backup + restore, add replication when availability becomes a real requirement |
The practical interview trick is sequencing. Start simple. Add read replicas when the primary is overloaded by reads. Add automated failover when downtime costs become meaningful.
๐งช Practical Example: Evolving a URL Shortener Beyond One Database
Suppose your first URL shortener stores everything in one relational database. That is a good first answer.
But traffic grows:
- Redirect reads climb into the tens of thousands per second.
- Marketing campaigns create bursty hot links.
- The business now cares about uptime because every outage breaks customer campaigns.
The first upgrade path looks like this:
- Keep one primary for new short-link writes.
- Add read replicas for redirect lookups that do not require the newest possible state.
- Put hot redirects behind a cache.
- Add automated failover so the primary is not a single point of outage.
That sequence is interview gold because it shows evolution rather than premature complexity. It also connects directly back to System Design Interview Basics: start simple, then justify the next layer only when the bottleneck appears.
๐ ๏ธ Spring Boot and HikariCP: Read Replica Routing at the Configuration Level
Spring Boot supports multiple DataSource beans, and AbstractRoutingDataSource can transparently route database calls โ @Transactional(readOnly = true) to a replica, write transactions to the primary โ without changing a single line of repository or service code.
How it solves the problem: The routing decision is expressed as a single annotation (readOnly = true) and resolved at the infrastructure layer. This mirrors the replication model described in this post: reads can go to any up-to-date replica, writes must go to the primary. The service layer expresses the intent; the datasource layer enforces it.
The datasource configuration declares two separate connection pools โ one pointing at the primary host with credentials that allow writes, and one pointing at the replica host with read-only credentials. The replica pool is typically sized larger than the primary pool because replicas exist precisely to absorb high read concurrency; a primary pool of 20 connections and a replica pool of 40 connections, for example, reflects the practical read scaling benefit that horizontal replicas provide. Spring Boot's AbstractRoutingDataSource then inspects the transaction context at runtime to decide which pool each database call should draw from.
In the service layer, annotating a read method with @Transactional(readOnly = true) signals that the operation can tolerate replication lag and does not require write ordering guarantees โ the routing datasource maps this annotation to the replica connection pool automatically. A write method with a plain @Transactional annotation routes to the primary, ensuring correct ordering and full durability. This means a query listing orders by customer ID draws from the replica pool and offloads the primary, while a call to place a new order always reaches the primary without any conditional routing logic in the service or repository code itself. The readOnly flag is the only application-level change needed; it expresses the replication trade-off described throughout this post โ reads accept possible lag in exchange for scale, writes demand the primary for correctness.
For a full deep-dive on AbstractRoutingDataSource, AOP-based context switching, and pgpool connection pooling for transparent replica failover, a dedicated follow-up post is planned.
๐ Lessons Learned
- Replication and failover are separate ideas: one copies data, the other restores authority.
- Read replicas help scale reads, but they do not guarantee fresh data.
- Asynchronous replication reduces write latency but can lose the newest acknowledged data during a crash.
- Safe failover depends on leader election and freshness checks, not just health probes.
- A strong interview answer explains both availability gains and correctness risks.
๐ TLDR: Summary & Key Takeaways
- Replication keeps multiple copies of data so the system can survive node loss.
- Failover promotes a healthy replica and reroutes traffic after primary failure.
- Synchronous replication favors safety; asynchronous replication favors speed.
- Replica lag, split-brain, and stale failover targets are the main risks to discuss.
- The best interview answers introduce replication only when reliability or read scale actually require it.
๐ Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable โ stale reads...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader โ each accepting writes the other never sees. Prevent it with quorum consensus (at least โN/2โ+1 nodes must agree before leadership is g...
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions โ but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node โ virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
