A Guide to Raft, Paxos, and Consensus Algorithms
How do distributed databases agree on data? We explain the Leader Election and Log Replication me...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR
TLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard — correct but notoriously hard to understand. Raft is the practical standard — designed for understandability, used in Kubernetes (etcd), Kafka (KRaft), and CockroachDB.
📖 The Restaurant Order Problem: Why Distributed Agreement Is Hard
Imagine a restaurant with three waiters and no central ticket system. A table orders "steak." Two waiters hear "steak," one hears "fish." They all run to the kitchen with different tickets.
This is the distributed systems problem: three servers receiving one write need to agree on the final value before confirming to the client. If one crashes mid-write, the others must not contradict each other.
Consensus algorithms solve this. They provide two fundamental guarantees:
- Safety: no two nodes ever commit different values for the same slot — even during a crash or network partition.
- Liveness: the cluster eventually makes progress as long as a majority of nodes are alive and can communicate.
Both Raft and Paxos satisfy these properties. Where they differ is in how readable and debuggable the protocol is for practising engineers.
🔍 Safety, Liveness, and Quorums: What Consensus Must Guarantee
The key rule that makes consensus possible is quorum — a simple majority. The cluster only commits a value once a quorum acknowledges it. You can lose floor(N/2) nodes and still write, but lose one more and writes stall.
| Cluster Size | Quorum (majority) | Max node failures tolerated |
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
Both Raft and Paxos rely on quorums. Neither sacrifices safety; they differ in liveness and protocol readability.
⚙️ Raft's Three Roles: Follower, Candidate, and Leader
Raft simplifies consensus by restricting who can write. At any moment every node is in exactly one state:
stateDiagram-v2
[*] --> Follower
Follower --> Candidate : election timeout (no heartbeat)
Candidate --> Leader : majority votes received
Candidate --> Follower : higher term seen
Leader --> Follower : higher term seen
| Role | Responsibility |
| Follower | Passive; accepts log entries and heartbeats from the leader |
| Candidate | Temporarily during elections; requests votes from peers |
| Leader | Handles all writes; sends heartbeats every ~150 ms |
Only the leader accepts client writes. This single-writer design makes the protocol easy to reason about — at any given term there is at most one authoritative source of truth.
🔢 Leader Election: Terms, Votes, and the Role of Randomness
Raft divides time into Terms — monotonically increasing integers. Think of them as electoral cycles. A new term begins whenever a follower suspects the leader has failed.
Election flow:
- A follower's election timer expires (no heartbeat received from a leader).
- It increments its term, transitions to candidate, votes for itself.
- It sends
RequestVoteRPCs to all other nodes. - A node grants its vote if: the candidate's term is at least as high and its log is at least as up-to-date.
- First candidate to get
(N/2 + 1)votes wins and becomes leader for that term.
Cluster: 5 nodes → quorum = 3 votes required
Term 7, Candidate A asks nodes B, C, D, E
B votes yes, C votes yes → A wins with 3/5 (including itself)
A sends heartbeats to all → everyone transitions to follower
Split votes (two candidates tie) are resolved by randomised timeouts — each follower waits a different random delay (e.g., 150–300 ms) before starting an election. The first to time out usually wins before others wake up, breaking any tie without a coordinator.
📊 Raft Leader Election Sequence
sequenceDiagram
participant F1 as Follower 1
participant F2 as Follower 2
participant F3 as Follower 3
Note over F1: Election timeout fires
F1->>F1: become Candidate, term++
F1->>F2: RequestVote (term=8)
F1->>F3: RequestVote (term=8)
F2-->>F1: vote granted
F3-->>F1: vote granted
Note over F1: majority (3/3) become Leader
F1->>F2: AppendEntries heartbeat
F1->>F3: AppendEntries heartbeat
Note over F2,F3: revert to Follower
🧠 Deep Dive: How Raft Replicates Logs to Followers
Once a leader is elected, client writes flow through the AppendEntries RPC. The sequence below is the core of Raft's safety guarantee: the leader only acknowledges a write to the client after a majority of nodes have persisted it.
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
C->>L: Write X=5
L->>F1: AppendEntries (X=5)
L->>F2: AppendEntries (X=5)
F1-->>L: ACK
F2-->>L: ACK
L->>L: Commit X=5
L-->>C: OK
The leader only confirms to the client after receiving ACKs from a quorum (F1 + F2). The write is durable even if the leader crashes immediately after replying.
The five-step write path:
- Client sends write to leader.
- Leader appends entry to its local log (as uncommitted).
- Leader broadcasts
AppendEntriesRPC to all followers in parallel. - Once a majority ACK the entry, leader marks it committed and applies it to the state machine.
- Leader replies OK to client; followers learn of the commit on the next heartbeat.
If a follower crashes before ACKing, the leader retries indefinitely. If the leader crashes after committing, the new leader will still carry the entry — it won election precisely because it had the most up-to-date log.
📊 The Full Raft Write Path: From Client Request to Durable Commit
The sequence diagram shows the happy path. This flowchart adds the commit guard and the retry loop for lagging followers:
flowchart TD
A[Client sends write to Leader] --> B[Leader appends to local log uncommitted]
B --> C[Leader broadcasts AppendEntries to all followers in parallel]
C --> D{Quorum ACKs received?}
D -- No --> E[Wait and retry lagging followers]
E --> D
D -- Yes --> F[Leader marks entry committed applies to state machine]
F --> G[Leader replies OK to client]
G --> H[Followers apply on next AppendEntries heartbeat]
Writes are blocked at the commit guard until a majority respond — this is what makes Raft strongly consistent. A client that gets OK can be certain the value survived any single-node failure.
⚖️ Trade-offs & Failure Modes: Raft vs Paxos
| Dimension | Raft | Paxos |
| Designed for | Understandability | Formal proof of correctness |
| Leader model | Single strong leader | Flexible, multi-proposer |
| Complexity | Lower — one spec, one paper | Higher — many competing variants |
| Common variants | Multi-Raft (per-shard groups) | Multi-Paxos, Fast Paxos, Flexible Paxos |
| Used in production | etcd, CockroachDB, TiKV, Consul | Chubby (Google), Zookeeper (Zab) |
Failure modes to know before you operate a Raft cluster:
- Leader isolation: A leader cut off from the majority keeps receiving client writes but cannot commit them (quorum ACKs will never arrive). The majority partition elects a new leader in a higher term. When the old leader reconnects, it sees the higher term and steps down, discarding all its uncommitted log entries. No committed data is ever lost.
- Follower lag: A slow or restarting follower doesn't block commits — the leader only needs a quorum. The lagging follower replays the log from its last known index on reconnect.
- Split-brain prevention: Raft prevents two leaders in the same term by requiring a majority vote. You cannot have two nodes simultaneously acting as leader in the same term.
🧭 Decision Guide: When to Use Raft vs Paxos vs ZooKeeper
| Situation | Recommendation |
| Building a new distributed store or metadata service | Use Raft — mature libraries exist (etcd/raft, hashicorp/raft); easy to debug |
| Need a formal correctness proof for a safety-critical system | Consider Paxos — richer academic tooling for proving safety properties |
| Already running ZooKeeper for coordination | Evaluate etcd as a migration target; stay if the system is stable and the team knows ZooKeeper |
| Multi-partition database (per-shard replication) | Use Multi-Raft (CockroachDB / TiKV model) — one Raft group per 64 MB range |
| System tolerates eventual consistency | Skip consensus entirely — use last-write-wins or CRDTs to avoid the per-write latency cost |
The rule of thumb: reach for Raft when you need strong consistency and want a protocol you can actually read, trace, and debug in a postmortem.
🌍 Real-World Applications: etcd, Kafka KRaft, and CockroachDB
- Kubernetes / etcd: Every
kubectlcommand that changes cluster state is linearised through etcd's Raft log. A 3-node etcd quorum tolerates one node failure; production deployments typically run 5 nodes for resilience during rolling upgrades. - Kafka KRaft mode (3.x+): Replaces ZooKeeper with a built-in Raft-based metadata log. The controller quorum (typically 3 nodes) runs Raft; each broker is a follower for metadata changes. This eliminates the separate ZooKeeper cluster and cuts operational complexity.
- CockroachDB / TiKV: Each 64 MB data range runs its own Raft group across 3 replicas. Node failures trigger elections only for affected ranges — the rest of the cluster keeps serving normally.
- Consul: Distributed service registry uses Raft for consistent key-value state across server nodes.
Operational consequence: Writes stall until a quorum of nodes is reachable. Lose more than floor(N/2) nodes and the cluster enters read-only mode until quorum is restored.
🧪 Walking Through a Kubernetes Control-Plane Write via etcd
When you run kubectl apply -f deployment.yaml, here is what happens at the consensus layer:
- The Kubernetes API server receives the request, validates it, and calls
etcd.Put("/registry/deployments/default/my-app", serializedSpec). - etcd's Raft leader appends the key-value write to its log (uncommitted).
- The leader broadcasts
AppendEntriesto the other two etcd nodes in the quorum. - On receiving ACKs from both followers (quorum = 2 of 3), the leader commits the entry and applies it to the in-memory key-value state.
- etcd returns
OKto the API server, which confirms tokubectl. Kubernetes controllers watching the key prefix see the change and begin reconciling — scheduling pods, updating replica sets.
If the etcd leader crashes between steps 3 and 5, the new leader replays the uncommitted entry (it won election because it had the most up-to-date log). The API server retries and gets OK once the new leader commits.
🛠️ Apache ZooKeeper, etcd, and Hazelcast: Raft Consensus in Java Production Systems
Apache ZooKeeper is the battle-tested coordination service behind Kafka (pre-3.x), Hadoop, and HBase — it implements Zab (ZooKeeper Atomic Broadcast), a Paxos variant. etcd is the Raft-based key-value store that powers Kubernetes. Hazelcast is a Java-native in-process distributed data grid whose CP subsystem implements Raft — the lowest-friction path to adding consensus to a Spring Boot application.
Hazelcast CP subsystem — distributed leader election and linearizable counters in Java:
// Maven: com.hazelcast:hazelcast:5.3.6
import com.hazelcast.config.*;
import com.hazelcast.core.*;
import com.hazelcast.cp.CPSubsystem;
import com.hazelcast.cp.lock.FencedLock;
import com.hazelcast.cp.IAtomicLong;
public class ConsensusDemo {
public static void main(String[] args) {
// Configure CP subsystem — minimum 3 members for Raft quorum
Config config = new Config();
config.getCPSubsystemConfig().setCPMemberCount(3); // 3-node Raft group
HazelcastInstance hz = Hazelcast.newHazelcastInstance(config);
CPSubsystem cp = hz.getCPSubsystem();
// 1. Distributed atomic counter — linearizable across all 3 members
// Every incrementAndGet() goes through a full Raft round-trip
IAtomicLong requestCounter = cp.getAtomicLong("request-counter");
long requestId = requestCounter.incrementAndGet();
System.out.println("Assigned ID (globally unique): " + requestId);
// 2. Distributed leader lock — only one node holds this lock at a time
// FencedLock uses Raft to ensure mutual exclusion even across network partitions
FencedLock leaderLock = cp.getLock("scheduler-leader-lock");
leaderLock.lock();
try {
System.out.println("I am the scheduler leader — running distributed cron job");
// Only one node in the cluster runs this block simultaneously
} finally {
leaderLock.unlock();
}
}
}
etcd Java client (jetcd) — reading Kubernetes cluster state written via Raft:
// Maven: io.etcd:jetcd-core:0.7.7
import io.etcd.jetcd.Client;
import io.etcd.jetcd.ByteSequence;
import java.nio.charset.StandardCharsets;
Client etcd = Client.builder().endpoints("http://etcd:2379").build();
// Read the Kubernetes Deployment spec stored via Raft by the API server
ByteSequence key = ByteSequence.from(
"/registry/deployments/default/my-app", StandardCharsets.UTF_8);
etcd.getKVClient().get(key).thenAccept(response ->
response.getKvs().forEach(kv ->
System.out.println("Deployment spec: " + kv.getValue().toString(StandardCharsets.UTF_8))
)
).get(); // blocks — use .thenAccept() for async in production
Choosing between ZooKeeper, etcd, and Hazelcast:
| System | Protocol | Java integration | Best for |
| Apache ZooKeeper | Zab (Paxos-variant) | Apache Curator client | Kafka metadata, HBase, legacy Hadoop |
| etcd | Raft | jetcd Java client | Kubernetes-native services, metadata stores |
| Hazelcast CP | Raft | In-process (same JVM) | Spring Boot services needing distributed locks/counters |
For new Spring Boot services, Hazelcast CP requires no external cluster to operate — the Raft group runs inside your application nodes. For Kubernetes-native services, jetcd reaches the existing etcd cluster directly.
For a full deep-dive on Hazelcast CP subsystem configuration, Apache Curator for ZooKeeper, and operating etcd clusters in production, a dedicated follow-up post is planned.
📚 Lessons from Running Consensus Clusters in Production
- Always run an odd number of nodes. A 4-node cluster tolerates the same 1 failure as a 3-node cluster but requires 3 ACKs instead of 2 — higher write latency with no additional fault tolerance. Use 3 or 5.
- Raft does not fix slow disks. Followers persist entries to disk before ACKing. An NVMe SSD versus a spinning disk can be the difference between 1 ms and 12 ms consensus round trips.
- Span nodes across failure domains, not just machines. A 3-node cluster with 2 nodes in AZ-A and 1 in AZ-B loses quorum if AZ-A fails. Distribute 1 node per AZ across 3 zones to survive a full AZ outage.
- Watch out for large Raft logs. etcd recommends keeping the total key-value space under 8 GB. Beyond that, snapshotting and log compaction slow recovery. Compact aggressively and monitor
etcd_mvcc_db_total_size_in_bytes.
📌 TLDR: Summary & Key Takeaways
- Consensus algorithms let a cluster agree on a single value even when nodes fail — the primitive behind distributed databases, service registries, and metadata stores.
- Raft uses a single leader per term; all writes flow through it, making the protocol straightforward to reason about and debug.
- Leader election uses term numbers, randomised timeouts, and majority votes to elect exactly one leader per term without a coordinator.
- Log replication commits only after a quorum ACKs — clients never observe uncommitted writes.
- Paxos is the academic foundation; Raft is the production standard. For new systems, start with an existing Raft library rather than implementing from scratch.
- Always run an odd number of nodes (3 or 5) and spread them across independent failure domains.
🔗 Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
