Distributed Systems — CAP Theorem, Consistency Models, Consensus Algorithms Explained
A distributed system is a collection of independent computers that appear to users as a single coherent system, coordinating through message passing to achieve a common goal.
Why Distributed Systems Matter
Every major service you use is a distributed system. Google Search runs across millions of servers. WhatsApp delivers billions of messages daily through geographically distributed clusters. Netflix streams to 200+ million devices using a global CDN, multiple data centers, and microservices. Understanding distributed systems is essential because almost every modern application is distributed — even a simple web app typically involves a browser, CDN, load balancer, application servers, and a database cluster.
Plain-Language Explanation
Imagine you’re writing a shared document with five colleagues. Each person has a copy on their laptop. When someone makes a change, the document must sync. If two people edit the same paragraph simultaneously, whose version wins? If the network goes down, can everyone keep editing locally and merge later?
This is the fundamental challenge of distributed systems: multiple independent computers, no shared memory, unreliable networks, and different clocks — yet the system must appear consistent and available. The solutions to these problems form the core of distributed systems theory.
graph TD
subgraph "Distributed System"
N1[Node 1] <--> N2[Node 2]
N2 <--> N3[Node 3]
N1 <--> N3
end
N1 --> T1[Transaction
Write x=5]
N2 --> T2[Transaction
Read x=?]
N3 --> T3[Transaction
Write x=10]
subgraph "Challenges"
CP[Network Partitions]
CS[Clock Skew]
CO[Consensus]
end
style N1 fill:#3498db,color:#fff
style N2 fill:#e67e22,color:#fff
style N3 fill:#27ae60,color:#fff
style CP fill:#e74c3c,color:#fff
style CS fill:#e74c3c,color:#fff
style CO fill:#e74c3c,color:#fff
CAP Theorem
The CAP theorem states that a distributed data store can only guarantee two of three properties simultaneously:
Consistency: Every read receives the most recent write or an error. All nodes see the same data at the same time.
Availability: Every request receives a non-error response, without guarantee it contains the most recent write.
Partition Tolerance: The system continues operating despite network partitions (messages lost or delayed between nodes).
In practice, network partitions are inevitable, so you choose between CP (consistency + partition tolerance) and AP (availability + partition tolerance). Traditional databases (MongoDB, HBase) are CP — they prefer consistency over availability during partitions. DNS and CDNs are AP — they serve potentially stale data rather than returning errors.
Consistency Models
Strong consistency: After a write completes, all subsequent reads return that value. Like a single-node database. Expensive in distributed systems because all nodes must agree before acknowledging.
Eventual consistency: Given enough time without updates, all replicas converge to the same value. DNS is eventually consistent — it can take up to 48 hours for changes to propagate worldwide.
Causal consistency: Operations that are causally related are seen in the same order by all nodes. Concurrent operations can be seen in different orders. Consistency Models are explored in depth on the dedicated page.
Consensus Algorithms
How do multiple nodes agree on a value? This is the consensus problem.
Paxos
The original consensus algorithm. A proposer suggests a value, acceptors vote, and if a majority accepts, the value is chosen. Paxos is correct but notoriously hard to understand and implement.
Raft
Designed to be understandable. Raft divides consensus into subproblems:
Leader election: Nodes vote for a leader. The leader handles all client requests.
Log replication: The leader appends entries to its log and replicates to followers. Once a majority confirms, the entry is committed.
Safety: If a leader crashes, the new leader is guaranteed to have all committed entries from previous terms.
# Simplified Raft consensus simulation
import random, time, threading
class RaftNode:
def __init__(self, node_id: int, peers: list[int]):
self.node_id = node_id
self.peers = peers
self.state = "follower"
self.current_term = 0
self.voted_for = None
self.log = []
self.commit_index = 0
def start_election(self):
self.state = "candidate"
self.current_term += 1
self.voted_for = self.node_id
votes = 1 # Vote for self
for peer in self.peers:
# Simulate requesting vote
if random.random() < 0.8: # 80% chance of yes
votes += 1
majority = len(self.peers) // 2 + 1
if votes >= majority:
self.state = "leader"
return True
self.state = "follower"
return False
# Simulate 5-node cluster
nodes = [RaftNode(i, [j for j in range(5) if j != i]) for i in range(5)]
# One node becomes leader
for node in nodes:
if node.start_election():
print(f"Node {node.node_id} elected leader for term {node.current_term}")
breakExpected output (randomized):
Node 2 elected leader for term 1Clock Synchronization
Distributed systems rely on time, but computer clocks drift. NTP (Network Time Protocol) synchronizes clocks to within milliseconds. Logical clocks (Lamport timestamps, vector clocks) track causality without wall clock time.
# Lamport logical clock
class LamportClock:
def __init__(self):
self.time = 0
def tick(self):
"""Increment on internal event"""
self.time += 1
return self.time
def send_event(self):
"""Include timestamp in outgoing message"""
self.time += 1
return self.time
def receive_event(self, remote_time: int):
"""Update on receiving message"""
self.time = max(self.time, remote_time) + 1
return self.time
clock_a = LamportClock()
clock_b = LamportClock()
msg_time = clock_a.send_event() # A sends message
print(f"A sends at time: {msg_time}")
clock_b.receive_event(msg_time) # B receives
print(f"B after receiving: {clock_b.time}")
clock_b.tick() # B has an internal event
print(f"B after internal event: {clock_b.time}")Expected output:
A sends at time: 1
B after receiving: 2
B after internal event: 3Common Mistakes
Ignoring network partitions: Assuming the network is reliable. In practice, packets are lost, delayed, and reordered. Design for partition tolerance from day one.
Assuming all nodes have the same time: Clock drift is real. A 1-second drift per day is normal for commodity hardware. Always use logical clocks for ordering.
Not handling partial failures: A remote call can fail slowly (hang until timeout). Always set timeouts and use circuit breakers.
Believing CAP is a one-time choice: Different operations can have different CAP tradeoffs. A social media platform might use strong consistency for payments and eventual consistency for likes.
Ignoring the cost of coordination: Strong consistency requires every node to communicate before responding. This adds latency that may be unacceptable for global applications.
Practice Questions
What does the CAP theorem state? A distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance. Since partitions are inevitable, you choose between CP and AP.
How does Raft differ from Paxos? Raft is designed for understandability, with clear roles (leader, follower, candidate) and subproblems (leader election, log replication, safety). Paxos is mathematically correct but notoriously hard to implement.
What problem does a logical clock solve? Physical clocks drift. Logical clocks (Lamport timestamps) provide event ordering without relying on synchronized wall clocks.
What is the difference between strong and eventual consistency? Strong consistency guarantees all reads return the latest write. Eventual consistency guarantees convergence over time but allows stale reads temporarily.
Why is consensus expensive in distributed systems? Consensus requires multiple rounds of communication between nodes (at least 2 round trips for Paxos/Raft) and a majority of nodes to agree. This adds latency proportional to network distance.
Mini Project
Simulate a leader election in a distributed cluster:
import random, time
class Cluster:
def __init__(self, size: int):
self.nodes = {i: {"alive": True, "term": 0} for i in range(size)}
self.leader = None
def simulate(self, rounds: int = 5):
for r in range(rounds):
print(f"\n--- Round {r+1} ---")
# Randomly fail a node
failed = random.choice(list(self.nodes.keys()))
self.nodes[failed]["alive"] = False
print(f"Node {failed} failed")
# Leader election on alive nodes
alive = [n for n, s in self.nodes.items() if s["alive"]]
if not alive:
print("All nodes dead")
continue
# Each alive node votes
votes = {}
for node in alive:
vote = random.choice(alive)
votes[vote] = votes.get(vote, 0) + 1
winner = max(votes, key=votes.get)
majority = len(alive) // 2 + 1
if votes[winner] >= majority:
self.leader = winner
print(f"Node {winner} elected with {votes[winner]}/{len(alive)} votes")
else:
print("No leader elected — split vote")
# Recover the failed node
self.nodes[failed]["alive"] = True
cluster = Cluster(5)
cluster.simulate(3)Cross-References
- Consistency Models
- System Design Overview
- Event-Driven Architecture
- Microservices Patterns
- Database Sharding
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro