Skip to content
Distributed Consensus: Paxos, Raft & Leader Election Explained

Distributed Consensus: Paxos, Raft & Leader Election Explained

DodaTech Updated Jun 20, 2026 9 min read

Distributed consensus algorithms enable multiple nodes in a network to agree on a single value despite failures — the foundation of fault-tolerant coordination in systems like ZooKeeper, etcd, and Chubby.

Why Consensus Algorithms Matter

Without consensus, distributed systems can’t agree on anything: who the leader is, what data was committed, or whether a transaction succeeded. Google’s Chubby lock service (based on Paxos) underpins the entire Google indexing pipeline. Kubernetes stores all cluster state in etcd (based on Raft). When consensus fails, you get split-brain — two nodes both believing they’re the leader, corrupting data. At DodaTech, similar consensus patterns coordinate replication in DodaZIP’s cloud sync infrastructure.

Plain-Language Explanation

Imagine three friends trying to decide where to eat dinner. They can’t meet in person, so they text each other. If two say “pizza” and one says “burgers,” the majority decides pizza. But what if one friend’s phone dies mid-conversation? The remaining two should still reach a decision. And what if all three send their choice at the exact same moment? They need a protocol to break the tie — someone goes first, others respond, and eventually they agree. That’s consensus.


graph TD
    subgraph "Consensus Spectrum"
        CP[Classic Paxos
Single value
2 phases] MP[Multi-Paxos
Log sequence
Stable leader] FP[Fast Paxos
1 round trip
Faster commit] R[Raft
Leader-centric
Understandable] Z[Zab
ZooKeeper
Atomic broadcast] end CP --> MP MP --> FP CP --> R CP --> Z Use1[etcd, Consul] --> R Use2[ZooKeeper] --> Z Use3[Google Chubby] --> CP style R fill:#27ae60,color:#fff style Z fill:#3498db,color:#fff style CP fill:#e67e22,color:#fff

Paxos: The Foundation

Paxos was the first practical consensus algorithm, published by Leslie Lamport in 1989. It guarantees safety (no two nodes decide different values) under any number of failures, as long as fewer than half the nodes fail.

Basic Paxos — Two Phases

The algorithm has two phases, each with a prepare and accept step:

Phase 1 (Prepare): A proposer chooses a proposal number N and sends a Prepare(N) request to a majority of acceptors. Each acceptor promises to reject any proposal with number < N.

Phase 2 (Accept): If the proposer receives responses from a majority, it sends an Accept(N, value) request. Acceptors accept unless they’ve already promised a higher-numbered proposal.

# Simplified Paxos acceptor
class Acceptor:
    def __init__(self):
        self.min_proposal = 0
        self.accepted_proposal = None
        self.accepted_value = None

    def prepare(self, proposal_id: int):
        if proposal_id <= self.min_proposal:
            return ("reject", self.min_proposal)
        self.min_proposal = proposal_id
        return ("promise", self.accepted_proposal, self.accepted_value)

    def accept(self, proposal_id: int, value: str):
        if proposal_id < self.min_proposal:
            return ("reject", self.min_proposal)
        self.min_proposal = proposal_id
        self.accepted_proposal = proposal_id
        self.accepted_value = value
        return ("accepted",)

# Test with 3 acceptors
accs = [Acceptor() for _ in range(3)]
# Proposer sends prepare(1) to all
results = [a.prepare(1) for a in accs]
print(f"Prepare results: {results}")
# All 3 promised — send accept
results = [a.accept(1, "x=5") for a in accs]
print(f"Accept results: {results}")

Output:

Prepare results: [('promise', None, None), ('promise', None, None), ('promise', None, None)]
Accept results: [('accepted',), ('accepted',), ('accepted',)]

Multi-Paxos

Basic Paxos agrees on a single value. Multi-Paxos extends it to a log of values by running repeated instances with a stable leader. Once a leader is elected, it skips Phase 1 for subsequent log entries and goes straight to Accept — significantly reducing latency.

Fast Paxos

Fast Paxos reduces the common case to a single round trip by letting proposers send Accept directly, without Prepare. This works when there’s no conflicting proposal — but if a collision occurs, recovery requires an extra round.

Raft: Understandable Consensus

Raft was designed specifically to be teachable and implementable. It decomposes consensus into three subproblems: leader election, log replication, and safety.

Leader Election

Nodes are in one of three states: follower, candidate, or leader. Followers expect periodic heartbeats from the leader. If a follower receives no heartbeat within an election timeout (150–300ms randomly), it becomes a candidate, increments its term, votes for itself, and requests votes from peers.

import random, threading, time

class RaftNode:
    def __init__(self, node_id, peers):
        self.node_id = node_id
        self.peers = peers
        self.state = "follower"
        self.current_term = 0
        self.voted_for = None
        self.log = []
        self.commit_index = 0
        self.last_applied = 0
        self.leader_id = None

    def start_election(self):
        self.state = "candidate"
        self.current_term += 1
        self.voted_for = self.node_id
        votes = 1
        for peer in self.peers:
            if self.request_vote(peer):
                votes += 1
        if votes > len(self.peers) // 2:
            self.state = "leader"
            self.leader_id = self.node_id
            print(f"Node {self.node_id} becomes leader for term {self.current_term}")
        else:
            self.state = "follower"

    def request_vote(self, peer_id):
        # Simulated — in real Raft, this sends an RPC
        return True  # Assume peer grants vote

# Simulate 5-node cluster
nodes = [RaftNode(i, [j for j in range(5) if j != i]) for i in range(5)]
# Node 2 starts election
nodes[2].start_election()
print(f"Node 2 state: {nodes[2].state}")
print(f"Node 2 term: {nodes[2].current_term}")

Output:

Node 2 becomes leader for term 1
Node 2 state: leader
Node 2 term: 1

Log Replication

The leader appends client commands to its log, then sends AppendEntries RPCs to followers. When a majority acknowledges, the entry is committed and applied to the state machine.

Term 1:  [cmd: "set x=3"]
Term 1:  [cmd: "set y=5"]
Term 2:  [cmd: "del x"]
          ↑ committed

Safety

Raft guarantees that if an entry is committed in a given term, it will never be overwritten by future leaders. A candidate can only become leader if it has all committed entries — it checks this during the election by comparing log indices.

Zab: ZooKeeper Atomic Broadcast

ZooKeeper uses Zab, which is similar to Raft but designed specifically for atomic broadcast (totally ordered messages). Key differences:

  • Zab guarantees that messages are delivered in order to all nodes
  • Uses a “primary” (leader) that sequences all write requests
  • Reads can be served by any follower for lower latency

ZooKeeper is used by Kafka, HBase, and Apache Solr for coordination.

Gossip Protocol

Gossip is a peer-to-peer communication protocol where each node periodically exchanges state with a random subset of peers. Information spreads exponentially — after O(log N) rounds, every node knows the information.

import random

class GossipNode:
    def __init__(self, node_id, peers):
        self.node_id = node_id
        self.peers = peers
        self.seen_messages = set()

    def gossip(self, message: str, fanout: int = 2):
        if message in self.seen_messages:
            return
        self.seen_messages.add(message)
        targets = random.sample(
            [p for p in self.peers if p != self.node_id],
            min(fanout, len(self.peers) - 1)
        )
        for target in targets:
            print(f"Node {self.node_id} → Node {target}: \"{message}\"")

nodes = [GossipNode(i, range(5)) for i in range(5)]
nodes[0].gossip("new config v2.1")

Output:

Node 0 → Node 3: "new config v2.1"
Node 0 → Node 1: "new config v2.1"

Quorum

A quorum is the minimum number of nodes that must participate in a decision for it to be valid. In a 5-node cluster, the quorum is 3 (simple majority). Some systems use different quorum types:

Quorum TypeSizeUse Case
Simple majorityN/2 + 1Raft, Paxos
Read quorumR nodesDynamo-style read repair
Write quorumW nodesDynamo-style writes
Hierarchical quorumVariableGeo-distributed clusters

Fault Tolerance

Fail-Stop

A node either works correctly or stops completely. This is the model Paxos and Raft assume. Easy to handle — just wait for majority.

Byzantine Fault

A node can behave arbitrarily — sending contradictory messages, lying about its state, or colluding with other faulty nodes. Byzantine Fault Tolerance (BFT) requires 3f+1 nodes to tolerate f faulty nodes. Used in blockchain consensus (PBFT, Tendermint).

CAP Theorem in Practice

Consensus algorithms operate in the CP side of the CAP theorem:

PropertyConsensus Behavior
ConsistencyAll nodes agree on committed values
AvailabilityHalted if quorum is lost
Partition ToleranceContinue if majority partition exists

During a network partition that splits a 5-node cluster into 3 and 2, the side with 3 nodes keeps operating (majority). The side with 2 nodes stops — it can’t form a quorum. This is CP by design.

ZooKeeper and etcd Use Cases

SystemConsensusUse Case
ZooKeeperZabCoordination service for Kafka, HBase, Solr
etcdRaftKubernetes cluster state, service discovery
ConsulRaftService mesh, health checking, KV store
ChubbyPaxosGoogle’s distributed lock service

Common Errors

  1. Quorum loss during rolling upgrade: Upgrading nodes one-by-one in a 3-node cluster means only 2 are operational at a time — one failure kills quorum. Use 5-node clusters for rolling upgrades.

  2. Election timeout misconfiguration: Setting all election timeouts to the same value causes repeated split votes. Randomize timeouts (150–300ms range).

  3. Ignoring disk I/O latency: Raft requires fsync before acknowledging writes. Slow disks increase commit latency. Use SSDs or separate WAL disks.

  4. Lease expiration without grace: Leader leases that expire prematurely cause unnecessary elections. Add a grace period of 2× heartbeat interval.

  5. Clock skew assumptions: Leader election relies on timeouts. If clocks drift by seconds, nodes may trigger elections incorrectly. Use NTP with tight bounds.

Practice Questions

1. What problem does the consensus algorithm solve?
It ensures multiple nodes agree on a single value despite network delays, message loss, and node failures. Without consensus, distributed systems can’t safely coordinate.
2. How does Raft handle a network partition?
The majority partition elects a new leader (if needed) and continues. The minority partition stops processing writes because it can’t form a quorum. When the partition heals, the minority node catches up.
3. What’s the difference between Paxos and Raft?
Raft is designed for understandability — it decomposes consensus into leader election, log replication, and safety. Paxos was designed first but is notoriously hard to implement correctly. Raft is used by etcd and Consul; Paxos (via Chubby) by Google.
4. Challenge: Implement a Raft leader election in Python.
Build a simulation with 5 nodes where each node has a randomized election timeout (150–500ms). When a candidate wins an election, it sends heartbeats every 50ms. If a follower misses 3 heartbeats, it starts a new election.

Mini Project

Build a consensus simulator that demonstrates leader election:

import random, threading, time

class RaftSimulator:
    def __init__(self, num_nodes=5):
        self.nodes = {
            i: {
                "state": "follower",
                "term": 0,
                "voted_for": None,
                "leader": None,
                "timeout": random.uniform(0.15, 0.5),
            }
            for i in range(num_nodes)
        }
        self.running = True

    def run_election(self, node_id):
        node = self.nodes[node_id]
        node["state"] = "candidate"
        node["term"] += 1
        node["voted_for"] = node_id
        votes = 1
        for peer_id in self.nodes:
            if peer_id != node_id:
                peer = self.nodes[peer_id]
                if peer["term"] < node["term"] and peer["state"] != "candidate":
                    peer["voted_for"] = node_id
                    votes += 1
        if votes > len(self.nodes) // 2:
            node["state"] = "leader"
            node["leader"] = node_id
            for peer_id in self.nodes:
                self.nodes[peer_id]["leader"] = node_id
            print(f"Node {node_id} elected leader for term {node['term']}")
            return True
        node["state"] = "follower"
        return False

sim = RaftSimulator(5)
sim.run_election(2)

Cross-References

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro