A Guide to Raft, Paxos, and Consensus Algorithms
How do distributed databases agree on data? We explain the Leader Election and Log Replication me...
Abstract AlgorithmsTLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard — correct but notoriously hard to understand. Raft is the practical standard — designed for understandability, used in Kubernetes (etcd), Kafka (KRaft), and CockroachDB.
📖 The Restaurant Order Problem: Why Distributed Agreement Is Hard
Imagine a restaurant with three waiters and no central ticket system. A table orders "steak." Two waiters hear "steak," one hears "fish." They all run to the kitchen with different tickets.
This is the distributed systems problem: three servers receiving one write need to agree on the final value before confirming to the client. If one crashes mid-write, the others must not contradict each other.
Consensus algorithms solve this. They guarantee:
- Safety: no two nodes commit different values for the same slot
- Liveness: the cluster eventually makes progress as long as a majority of nodes are alive
⚙️ Raft's Three Roles: Follower, Candidate, and Leader
Raft simplifies consensus by restricting who can write. At any moment every node is in exactly one state:
stateDiagram-v2
[*] --> Follower
Follower --> Candidate : election timeout (no heartbeat)
Candidate --> Leader : majority votes received
Candidate --> Follower : higher term seen
Leader --> Follower : higher term seen
| Role | Responsibility |
| Follower | Passive; accepts log entries and heartbeats from the leader |
| Candidate | Temporarily during elections; requests votes from peers |
| Leader | Handles all writes; sends heartbeats every ~150 ms |
Only the leader accepts client writes. This single-writer design makes the protocol easy to reason about.
🔢 Leader Election: Terms, Votes, and Quorum
Raft divides time into Terms — monotonically increasing integers. Think of them as electoral cycles.
Election flow:
- A follower's election timer expires (no heartbeat from leader).
- It increments its term, transitions to candidate, votes for itself.
- It sends
RequestVoteRPCs to all other nodes. - A node grants its vote if: the candidate's term is at least as high and its log is at least as up-to-date.
- First candidate to get
(N/2 + 1)votes wins and becomes leader.
Cluster: 5 nodes → quorum = 3 votes required
Term 7, Candidate A asks nodes B, C, D, E
B votes yes, C votes yes → A wins with 3/5 (including itself)
A sends heartbeats to all → everyone transitions to follower
Split votes (two candidates tie) are resolved by randomised timeouts — each follower waits a different random delay before starting an election.
🧠 Log Replication: How Raft Keeps Data Consistent
Once a leader is elected, writes flow like this:
- Client sends write to leader.
- Leader appends entry to its local log (uncommitted).
- Leader sends
AppendEntriesRPC to all followers. - Once a majority acknowledge the entry, leader marks it committed.
- Leader notifies followers to apply the entry; leader replies to client.
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
C->>L: Write X=5
L->>F1: AppendEntries (X=5)
L->>F2: AppendEntries (X=5)
F1-->>L: ACK
F2-->>L: ACK
L->>L: Commit X=5
L-->>C: OK
If a follower crashes before acknowledging, the leader retries. If the leader crashes, the new leader catches followers up using the log — ensuring no committed entry is ever lost.
⚖️ Raft vs Paxos: Simplicity vs Formal Rigor
| Dimension | Raft | Paxos |
| Designed for | Understandability | Formal proof of correctness |
| Leader model | Single strong leader | Flexible, multi-proposer |
| Complexity | Lower | Higher |
| Variants | Raft, Multi-Raft | Multi-Paxos, Fast Paxos, Flexible Paxos |
| Used in production | etcd, CockroachDB, TiKV | Chubby (Google), Zookeeper (Zab, similar) |
Raft is the answer to "I need a consensus library I can actually implement and debug." Paxos is the answer to "I need to prove this protocol is correct."
🌍 Where You'll Find Raft in the Wild
- Kubernetes: etcd (cluster state store) uses Raft. Every kubectl command that changes cluster state goes through etcd's Raft log.
- Kafka: KRaft mode (Kafka 3.x) replaces ZooKeeper with a built-in Raft-based metadata log.
- CockroachDB / TiKV: Each shard (range) runs its own Raft group to replicate data.
- Consul: Distributed service registry uses Raft for consistent key-value state.
Operational consequence: In a Raft cluster, writes stall until a quorum of nodes is reachable. If you lose more than floor(N/2) nodes simultaneously, the cluster enters a read-only state until you restore quorum.
📌 Key Takeaways
- Consensus algorithms let a cluster agree on a value even when nodes fail.
- Raft uses a single leader per term; all writes go through it.
- Leader election is driven by term numbers, randomised timeouts, and majority voting.
- Log replication requires acknowledgement from a quorum before committing.
- Raft is the production standard (etcd, Kafka KRaft, CockroachDB); Paxos is the academic foundation.
🧩 Test Your Understanding
- What triggers a follower to start an election in Raft?
- A 5-node Raft cluster has 2 nodes unreachable. Can it still accept writes?
- Why do randomised election timeouts prevent split votes from looping forever?
- What is a "term" in Raft and why does it always increase?
🔗 Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. �...
