Distributed Consensus: Paxos, Raft & Leader Election Explained
Distributed consensus algorithms enable multiple nodes in a network to agree on a single value despite failures — the foundation of fault-tolerant coordination in systems like ZooKeeper, etcd, and Chubby.
Why Consensus Algorithms Matter
Without consensus, distributed systems can’t agree on anything: who the leader is, what data was committed, or whether a transaction succeeded. Google’s Chubby lock service (based on Paxos) underpins the entire Google indexing pipeline. Kubernetes stores all cluster state in etcd (based on Raft). When consensus fails, you get split-brain — two nodes both believing they’re the leader, corrupting data. At DodaTech, similar consensus patterns coordinate replication in DodaZIP’s cloud sync infrastructure.
Plain-Language Explanation
Imagine three friends trying to decide where to eat dinner. They can’t meet in person, so they text each other. If two say “pizza” and one says “burgers,” the majority decides pizza. But what if one friend’s phone dies mid-conversation? The remaining two should still reach a decision. And what if all three send their choice at the exact same moment? They need a protocol to break the tie — someone goes first, others respond, and eventually they agree. That’s consensus.
graph TD
subgraph "Consensus Spectrum"
CP[Classic Paxos
Single value
2 phases]
MP[Multi-Paxos
Log sequence
Stable leader]
FP[Fast Paxos
1 round trip
Faster commit]
R[Raft
Leader-centric
Understandable]
Z[Zab
ZooKeeper
Atomic broadcast]
end
CP --> MP
MP --> FP
CP --> R
CP --> Z
Use1[etcd, Consul] --> R
Use2[ZooKeeper] --> Z
Use3[Google Chubby] --> CP
style R fill:#27ae60,color:#fff
style Z fill:#3498db,color:#fff
style CP fill:#e67e22,color:#fff
Paxos: The Foundation
Paxos was the first practical consensus algorithm, published by Leslie Lamport in 1989. It guarantees safety (no two nodes decide different values) under any number of failures, as long as fewer than half the nodes fail.
Basic Paxos — Two Phases
The algorithm has two phases, each with a prepare and accept step:
Phase 1 (Prepare): A proposer chooses a proposal number N and sends a Prepare(N) request to a majority of acceptors. Each acceptor promises to reject any proposal with number < N.
Phase 2 (Accept): If the proposer receives responses from a majority, it sends an Accept(N, value) request. Acceptors accept unless they’ve already promised a higher-numbered proposal.
# Simplified Paxos acceptor
class Acceptor:
def __init__(self):
self.min_proposal = 0
self.accepted_proposal = None
self.accepted_value = None
def prepare(self, proposal_id: int):
if proposal_id <= self.min_proposal:
return ("reject", self.min_proposal)
self.min_proposal = proposal_id
return ("promise", self.accepted_proposal, self.accepted_value)
def accept(self, proposal_id: int, value: str):
if proposal_id < self.min_proposal:
return ("reject", self.min_proposal)
self.min_proposal = proposal_id
self.accepted_proposal = proposal_id
self.accepted_value = value
return ("accepted",)
# Test with 3 acceptors
accs = [Acceptor() for _ in range(3)]
# Proposer sends prepare(1) to all
results = [a.prepare(1) for a in accs]
print(f"Prepare results: {results}")
# All 3 promised — send accept
results = [a.accept(1, "x=5") for a in accs]
print(f"Accept results: {results}")Output:
Prepare results: [('promise', None, None), ('promise', None, None), ('promise', None, None)]
Accept results: [('accepted',), ('accepted',), ('accepted',)]Multi-Paxos
Basic Paxos agrees on a single value. Multi-Paxos extends it to a log of values by running repeated instances with a stable leader. Once a leader is elected, it skips Phase 1 for subsequent log entries and goes straight to Accept — significantly reducing latency.
Fast Paxos
Fast Paxos reduces the common case to a single round trip by letting proposers send Accept directly, without Prepare. This works when there’s no conflicting proposal — but if a collision occurs, recovery requires an extra round.
Raft: Understandable Consensus
Raft was designed specifically to be teachable and implementable. It decomposes consensus into three subproblems: leader election, log replication, and safety.
Leader Election
Nodes are in one of three states: follower, candidate, or leader. Followers expect periodic heartbeats from the leader. If a follower receives no heartbeat within an election timeout (150–300ms randomly), it becomes a candidate, increments its term, votes for itself, and requests votes from peers.
import random, threading, time
class RaftNode:
def __init__(self, node_id, peers):
self.node_id = node_id
self.peers = peers
self.state = "follower"
self.current_term = 0
self.voted_for = None
self.log = []
self.commit_index = 0
self.last_applied = 0
self.leader_id = None
def start_election(self):
self.state = "candidate"
self.current_term += 1
self.voted_for = self.node_id
votes = 1
for peer in self.peers:
if self.request_vote(peer):
votes += 1
if votes > len(self.peers) // 2:
self.state = "leader"
self.leader_id = self.node_id
print(f"Node {self.node_id} becomes leader for term {self.current_term}")
else:
self.state = "follower"
def request_vote(self, peer_id):
# Simulated — in real Raft, this sends an RPC
return True # Assume peer grants vote
# Simulate 5-node cluster
nodes = [RaftNode(i, [j for j in range(5) if j != i]) for i in range(5)]
# Node 2 starts election
nodes[2].start_election()
print(f"Node 2 state: {nodes[2].state}")
print(f"Node 2 term: {nodes[2].current_term}")Output:
Node 2 becomes leader for term 1
Node 2 state: leader
Node 2 term: 1Log Replication
The leader appends client commands to its log, then sends AppendEntries RPCs to followers. When a majority acknowledges, the entry is committed and applied to the state machine.
Term 1: [cmd: "set x=3"]
Term 1: [cmd: "set y=5"]
Term 2: [cmd: "del x"]
↑ committedSafety
Raft guarantees that if an entry is committed in a given term, it will never be overwritten by future leaders. A candidate can only become leader if it has all committed entries — it checks this during the election by comparing log indices.
Zab: ZooKeeper Atomic Broadcast
ZooKeeper uses Zab, which is similar to Raft but designed specifically for atomic broadcast (totally ordered messages). Key differences:
- Zab guarantees that messages are delivered in order to all nodes
- Uses a “primary” (leader) that sequences all write requests
- Reads can be served by any follower for lower latency
ZooKeeper is used by Kafka, HBase, and Apache Solr for coordination.
Gossip Protocol
Gossip is a peer-to-peer communication protocol where each node periodically exchanges state with a random subset of peers. Information spreads exponentially — after O(log N) rounds, every node knows the information.
import random
class GossipNode:
def __init__(self, node_id, peers):
self.node_id = node_id
self.peers = peers
self.seen_messages = set()
def gossip(self, message: str, fanout: int = 2):
if message in self.seen_messages:
return
self.seen_messages.add(message)
targets = random.sample(
[p for p in self.peers if p != self.node_id],
min(fanout, len(self.peers) - 1)
)
for target in targets:
print(f"Node {self.node_id} → Node {target}: \"{message}\"")
nodes = [GossipNode(i, range(5)) for i in range(5)]
nodes[0].gossip("new config v2.1")Output:
Node 0 → Node 3: "new config v2.1"
Node 0 → Node 1: "new config v2.1"Quorum
A quorum is the minimum number of nodes that must participate in a decision for it to be valid. In a 5-node cluster, the quorum is 3 (simple majority). Some systems use different quorum types:
| Quorum Type | Size | Use Case |
|---|---|---|
| Simple majority | N/2 + 1 | Raft, Paxos |
| Read quorum | R nodes | Dynamo-style read repair |
| Write quorum | W nodes | Dynamo-style writes |
| Hierarchical quorum | Variable | Geo-distributed clusters |
Fault Tolerance
Fail-Stop
A node either works correctly or stops completely. This is the model Paxos and Raft assume. Easy to handle — just wait for majority.
Byzantine Fault
A node can behave arbitrarily — sending contradictory messages, lying about its state, or colluding with other faulty nodes. Byzantine Fault Tolerance (BFT) requires 3f+1 nodes to tolerate f faulty nodes. Used in blockchain consensus (PBFT, Tendermint).
CAP Theorem in Practice
Consensus algorithms operate in the CP side of the CAP theorem:
| Property | Consensus Behavior |
|---|---|
| Consistency | All nodes agree on committed values |
| Availability | Halted if quorum is lost |
| Partition Tolerance | Continue if majority partition exists |
During a network partition that splits a 5-node cluster into 3 and 2, the side with 3 nodes keeps operating (majority). The side with 2 nodes stops — it can’t form a quorum. This is CP by design.
ZooKeeper and etcd Use Cases
| System | Consensus | Use Case |
|---|---|---|
| ZooKeeper | Zab | Coordination service for Kafka, HBase, Solr |
| etcd | Raft | Kubernetes cluster state, service discovery |
| Consul | Raft | Service mesh, health checking, KV store |
| Chubby | Paxos | Google’s distributed lock service |
Common Errors
Quorum loss during rolling upgrade: Upgrading nodes one-by-one in a 3-node cluster means only 2 are operational at a time — one failure kills quorum. Use 5-node clusters for rolling upgrades.
Election timeout misconfiguration: Setting all election timeouts to the same value causes repeated split votes. Randomize timeouts (150–300ms range).
Ignoring disk I/O latency: Raft requires fsync before acknowledging writes. Slow disks increase commit latency. Use SSDs or separate WAL disks.
Lease expiration without grace: Leader leases that expire prematurely cause unnecessary elections. Add a grace period of 2× heartbeat interval.
Clock skew assumptions: Leader election relies on timeouts. If clocks drift by seconds, nodes may trigger elections incorrectly. Use NTP with tight bounds.
Mini Project
Build a consensus simulator that demonstrates leader election:
import random, threading, time
class RaftSimulator:
def __init__(self, num_nodes=5):
self.nodes = {
i: {
"state": "follower",
"term": 0,
"voted_for": None,
"leader": None,
"timeout": random.uniform(0.15, 0.5),
}
for i in range(num_nodes)
}
self.running = True
def run_election(self, node_id):
node = self.nodes[node_id]
node["state"] = "candidate"
node["term"] += 1
node["voted_for"] = node_id
votes = 1
for peer_id in self.nodes:
if peer_id != node_id:
peer = self.nodes[peer_id]
if peer["term"] < node["term"] and peer["state"] != "candidate":
peer["voted_for"] = node_id
votes += 1
if votes > len(self.nodes) // 2:
node["state"] = "leader"
node["leader"] = node_id
for peer_id in self.nodes:
self.nodes[peer_id]["leader"] = node_id
print(f"Node {node_id} elected leader for term {node['term']}")
return True
node["state"] = "follower"
return False
sim = RaftSimulator(5)
sim.run_election(2)Cross-References
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro