Database Sharding — Horizontal Partitioning Strategies with Examples
Database sharding is a horizontal partitioning technique that splits a large database into smaller, independent databases called shards, each holding a subset of the data, to achieve near-linear scalability.
Why Sharding Matters
A single database server has hard limits — CPU, memory, disk IOPS, and storage capacity. Instagram stores billions of photos. GitHub has millions of repositories. These datasets don’t fit on one machine. Sharding distributes data across hundreds of database servers, each handling a fraction of the load. Without sharding, the database becomes the bottleneck for any growing application. Facebook’s TAO shard serves billions of objects across thousands of servers.
Plain-Language Explanation
Imagine a library with one bookshelf that has 100 shelves. As the library grows to a million books, that bookshelf can’t hold them all. Sharding is like building separate rooms, each with its own bookshelf. Room A holds books where the title starts with A-F, Room B holds G-L, and so on. When someone asks for a book, the librarian (your application) computes which room it’s in and goes directly there. No room has all the books, so searching within a room is fast.
But what happens when Room A fills up? You need to split it into A-C and D-F, and move books around. That’s rebalancing — the hard part of sharding.
graph TD
App[Application] --> Router[Shard Router]
Router --> Shard0[Shard 0
users 1-1000]
Router --> Shard1[Shard 1
users 1001-2000]
Router --> Shard2[Shard 2
users 2001-3000]
Router --> ShardN[Shard N
users N*1000+1...]
App --> Config[Shard Map]
Config --> Router
style Shard0 fill:#3498db,color:#fff
style Shard1 fill:#e67e22,color:#fff
style Shard2 fill:#27ae60,color:#fff
style ShardN fill:#9b59b6,color:#fff
style Config fill:#e74c3c,color:#fff
Sharding Strategies
Hash-Based Sharding
The most common approach. Hash the shard key (e.g., user_id) and take modulo N (number of shards) to determine which shard stores the data.
def get_shard(user_id: int, num_shards: int = 16) -> int:
return hash(user_id) % num_shards
print(f"User 42 -> shard {get_shard(42)}")
print(f"User 999 -> shard {get_shard(999)}")
# Output:
# User 42 -> shard 6
# User 999 -> shard 11Pros: Even distribution of data. Simple to implement.
Cons: When N changes (add/remove shards), every record’s shard assignment changes — massive data migration.
Range-Based Sharding
Divide data by value ranges, e.g., user_id 1-1000 → Shard 0, 1001-2000 → Shard 1.
SHARD_RANGES = [
(0, 0, 1000),
(1, 1001, 2000),
(2, 2001, 3000),
]
def get_shard(user_id: int) -> int:
for shard_id, start, end in SHARD_RANGES:
if start <= user_id <= end:
return shard_id
raise ValueError(f"No shard for user {user_id}")Pros: Easy to add new shards (increase the range). Good for ordered data like time-series.
Cons: Hot spots — the most recent shard gets all the writes. Data distribution is uneven if the key isn’t uniform.
Directory-Based Sharding
Maintain a lookup table that maps keys to shards. The mapping can be changed independently of the data.
# shard_map = {"user:42": 3, "user:999": 7, ...}
def get_shard_from_directory(key: str) -> int:
# In production: query a distributed config store (etcd, ZooKeeper)
return shard_map.get(key, 0)Pros: Full flexibility. Can move data between shards without changing the algorithm.
Cons: The lookup table is a single point of failure and can become a bottleneck at scale.
Rebalancing Challenges
When you add new shards, existing data must be redistributed. For hash-based sharding, this means rehashing everything. Approaches to handle this:
Virtual sharding (consistent hashing): Each physical shard manages many virtual shards. When adding a new shard, only a subset of virtual shards move. Minimizes data movement.
Live migration: Move data shard-by-shard while serving traffic. Expensive but necessary for high-availability systems.
Maintenance window: Take the database offline, redistribute data, bring back online. Simplest but causes downtime.
# Consistent hashing example
import hashlib
class ConsistentHash:
def __init__(self, nodes: list[str], replicas: int = 3):
self.replicas = replicas
self.ring = {}
for node in nodes:
for i in range(replicas):
key = self._hash(f"{node}:{i}")
self.ring[key] = node
self.sorted_keys = sorted(self.ring.keys())
def _hash(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def get_node(self, key: str) -> str:
if not self.ring:
return None
hash_val = self._hash(key)
for ring_key in self.sorted_keys:
if hash_val <= ring_key:
return self.ring[ring_key]
return self.ring[self.sorted_keys[0]]
# 3 physical nodes, each with 3 virtual replicas
ring = ConsistentHash(["db1", "db2", "db3"])
for user_id in [1, 42, 100, 999, 5000]:
node = ring.get_node(f"user:{user_id}")
print(f"User {user_id} -> {node}")Shard Key Selection
Choosing the right shard key is the most important decision. A bad shard key creates hot spots (one shard gets most traffic) or makes cross-shard queries expensive.
Good shard keys: user_id, customer_id, tenant_id — natural high-cardinality fields that distribute evenly.
Bad shard keys: status field (few values), date (hot spot on today’s data), country (if 80% of users are in one country).
Common Mistakes
Choosing the wrong shard key: Once data is sharded, changing the shard key requires massive migration. Choose a high-cardinality, evenly distributed key.
Cross-shard queries: Joining data across shards is slow (query all shards + merge). Design your schema so queries only need one shard.
Sharding too early: A single PostgreSQL instance handles millions of records. Sharding adds complexity — only shard when you’ve exhausted vertical scaling and read replicas.
Ignoring shard imbalance: Over time, some shards grow faster than others. Monitor shard sizes and have a rebalancing plan.
No backup strategy: If one shard fails, 1/N of your data is unavailable. Each shard needs its own backup and replication.
Practice Questions
What is the main tradeoff of hash-based sharding? Even distribution is excellent, but adding/removing shards requires rehashing all data. Consistent hashing mitigates this.
How does consistent hashing reduce data movement during rebalancing? Each physical node maps to multiple virtual nodes on the ring. Adding a node transfers only the virtual nodes that fall between the new node and its predecessor on the ring.
Why is cross-shard joining expensive? The query must run against every shard and merge results. This gets slower as shards increase. Design to keep related data on the same shard.
What happens if a shard’s storage fills up? The shard becomes read-only or fails on writes. You must either split the shard, add more storage, or redistribute data.
When should you use directory-based sharding? When you need fine-grained control over data placement, or when the shard key space is sparse (many possible keys, few active).
Mini Project
Simulate a sharded user database with hash-based routing:
import hashlib, json
SHARDS = 4
users_db = {i: {} for i in range(SHARDS)}
def shard_for(user_id: int) -> int:
return int(hashlib.md5(str(user_id).encode()).hexdigest(), 16) % SHARDS
def add_user(user_id: int, name: str, email: str):
shard = shard_for(user_id)
users_db[shard][user_id] = {"name": name, "email": email}
print(f"User {user_id} stored in shard {shard}")
def get_user(user_id: int):
shard = shard_for(user_id)
user = users_db[shard].get(user_id)
return user, shard
# Add users
add_user(1, "Alice", "alice@example.com")
add_user(2, "Bob", "bob@example.com")
add_user(100, "Charlie", "charlie@example.com")
add_user(999, "Diana", "diana@example.com")
# Retrieve and show which shard each is on
for uid in [1, 2, 100, 999]:
user, shard = get_user(uid)
print(f"User {uid} ({user['name']}) is on shard {shard}")Expected output:
User 1 stored in shard 1
User 2 stored in shard 2
User 100 stored in shard 0
User 999 stored in shard 2
User 1 (Alice) is on shard 1
User 2 (Bob) is on shard 2
User 100 (Charlie) is on shard 0
User 999 (Diana) is on shard 2Cross-References
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro