Learn System: Database Sharding — Horizontal Partitioning Strategies with Examples

Database Sharding — Horizontal Partitioning Strategies with Examples

DodaTech Updated Jun 15, 2026 6 min read

Database sharding is a horizontal partitioning technique that splits a large database into smaller, independent databases called shards, each holding a subset of the data, to achieve near-linear scalability.

Why Sharding Matters

A single database server has hard limits — CPU, memory, disk IOPS, and storage capacity. Instagram stores billions of photos. GitHub has millions of repositories. These datasets don’t fit on one machine. Sharding distributes data across hundreds of database servers, each handling a fraction of the load. Without sharding, the database becomes the bottleneck for any growing application. Facebook’s TAO shard serves billions of objects across thousands of servers.

Plain-Language Explanation

Imagine a library with one bookshelf that has 100 shelves. As the library grows to a million books, that bookshelf can’t hold them all. Sharding is like building separate rooms, each with its own bookshelf. Room A holds books where the title starts with A-F, Room B holds G-L, and so on. When someone asks for a book, the librarian (your application) computes which room it’s in and goes directly there. No room has all the books, so searching within a room is fast.

But what happens when Room A fills up? You need to split it into A-C and D-F, and move books around. That’s rebalancing — the hard part of sharding.


graph TD
    App[Application] --> Router[Shard Router]
    Router --> Shard0[Shard 0
users 1-1000]
    Router --> Shard1[Shard 1
users 1001-2000]
    Router --> Shard2[Shard 2
users 2001-3000]
    Router --> ShardN[Shard N
users N*1000+1...]
    App --> Config[Shard Map]
    Config --> Router
    style Shard0 fill:#3498db,color:#fff
    style Shard1 fill:#e67e22,color:#fff
    style Shard2 fill:#27ae60,color:#fff
    style ShardN fill:#9b59b6,color:#fff
    style Config fill:#e74c3c,color:#fff

Sharding Strategies

Hash-Based Sharding

The most common approach. Hash the shard key (e.g., user_id) and take modulo N (number of shards) to determine which shard stores the data.

def get_shard(user_id: int, num_shards: int = 16) -> int:
    return hash(user_id) % num_shards

print(f"User 42 -> shard {get_shard(42)}")
print(f"User 999 -> shard {get_shard(999)}")
# Output:
# User 42 -> shard 6
# User 999 -> shard 11

Pros: Even distribution of data. Simple to implement.

Cons: When N changes (add/remove shards), every record’s shard assignment changes — massive data migration.

Range-Based Sharding

Divide data by value ranges, e.g., user_id 1-1000 → Shard 0, 1001-2000 → Shard 1.

SHARD_RANGES = [
    (0, 0, 1000),
    (1, 1001, 2000),
    (2, 2001, 3000),
]

def get_shard(user_id: int) -> int:
    for shard_id, start, end in SHARD_RANGES:
        if start <= user_id <= end:
            return shard_id
    raise ValueError(f"No shard for user {user_id}")

Pros: Easy to add new shards (increase the range). Good for ordered data like time-series.

Cons: Hot spots — the most recent shard gets all the writes. Data distribution is uneven if the key isn’t uniform.

Directory-Based Sharding

Maintain a lookup table that maps keys to shards. The mapping can be changed independently of the data.

# shard_map = {"user:42": 3, "user:999": 7, ...}
def get_shard_from_directory(key: str) -> int:
    # In production: query a distributed config store (etcd, ZooKeeper)
    return shard_map.get(key, 0)

Pros: Full flexibility. Can move data between shards without changing the algorithm.

Cons: The lookup table is a single point of failure and can become a bottleneck at scale.

Rebalancing Challenges

When you add new shards, existing data must be redistributed. For hash-based sharding, this means rehashing everything. Approaches to handle this:

Virtual sharding (consistent hashing): Each physical shard manages many virtual shards. When adding a new shard, only a subset of virtual shards move. Minimizes data movement.

Live migration: Move data shard-by-shard while serving traffic. Expensive but necessary for high-availability systems.

Maintenance window: Take the database offline, redistribute data, bring back online. Simplest but causes downtime.

# Consistent hashing example
import hashlib

class ConsistentHash:
    def __init__(self, nodes: list[str], replicas: int = 3):
        self.replicas = replicas
        self.ring = {}
        for node in nodes:
            for i in range(replicas):
                key = self._hash(f"{node}:{i}")
                self.ring[key] = node
        self.sorted_keys = sorted(self.ring.keys())

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def get_node(self, key: str) -> str:
        if not self.ring:
            return None
        hash_val = self._hash(key)
        for ring_key in self.sorted_keys:
            if hash_val <= ring_key:
                return self.ring[ring_key]
        return self.ring[self.sorted_keys[0]]

# 3 physical nodes, each with 3 virtual replicas
ring = ConsistentHash(["db1", "db2", "db3"])
for user_id in [1, 42, 100, 999, 5000]:
    node = ring.get_node(f"user:{user_id}")
    print(f"User {user_id} -> {node}")

Shard Key Selection

Choosing the right shard key is the most important decision. A bad shard key creates hot spots (one shard gets most traffic) or makes cross-shard queries expensive.

Good shard keys: user_id, customer_id, tenant_id — natural high-cardinality fields that distribute evenly.

Bad shard keys: status field (few values), date (hot spot on today’s data), country (if 80% of users are in one country).

Common Mistakes

Choosing the wrong shard key: Once data is sharded, changing the shard key requires massive migration. Choose a high-cardinality, evenly distributed key.
Cross-shard queries: Joining data across shards is slow (query all shards + merge). Design your schema so queries only need one shard.
Sharding too early: A single PostgreSQL instance handles millions of records. Sharding adds complexity — only shard when you’ve exhausted vertical scaling and read replicas.
Ignoring shard imbalance: Over time, some shards grow faster than others. Monitor shard sizes and have a rebalancing plan.
No backup strategy: If one shard fails, 1/N of your data is unavailable. Each shard needs its own backup and replication.

Practice Questions

What is the main tradeoff of hash-based sharding? Even distribution is excellent, but adding/removing shards requires rehashing all data. Consistent hashing mitigates this.
How does consistent hashing reduce data movement during rebalancing? Each physical node maps to multiple virtual nodes on the ring. Adding a node transfers only the virtual nodes that fall between the new node and its predecessor on the ring.
Why is cross-shard joining expensive? The query must run against every shard and merge results. This gets slower as shards increase. Design to keep related data on the same shard.
What happens if a shard’s storage fills up? The shard becomes read-only or fails on writes. You must either split the shard, add more storage, or redistribute data.
When should you use directory-based sharding? When you need fine-grained control over data placement, or when the shard key space is sparse (many possible keys, few active).

Mini Project

Simulate a sharded user database with hash-based routing:

import hashlib, json

SHARDS = 4
users_db = {i: {} for i in range(SHARDS)}

def shard_for(user_id: int) -> int:
    return int(hashlib.md5(str(user_id).encode()).hexdigest(), 16) % SHARDS

def add_user(user_id: int, name: str, email: str):
    shard = shard_for(user_id)
    users_db[shard][user_id] = {"name": name, "email": email}
    print(f"User {user_id} stored in shard {shard}")

def get_user(user_id: int):
    shard = shard_for(user_id)
    user = users_db[shard].get(user_id)
    return user, shard

# Add users
add_user(1, "Alice", "alice@example.com")
add_user(2, "Bob", "bob@example.com")
add_user(100, "Charlie", "charlie@example.com")
add_user(999, "Diana", "diana@example.com")

# Retrieve and show which shard each is on
for uid in [1, 2, 100, 999]:
    user, shard = get_user(uid)
    print(f"User {uid} ({user['name']}) is on shard {shard}")

Expected output:

User 1 stored in shard 1
User 2 stored in shard 2
User 100 stored in shard 0
User 999 stored in shard 2
User 1 (Alice) is on shard 1
User 2 (Bob) is on shard 2
User 100 (Charlie) is on shard 0
User 999 (Diana) is on shard 2

Cross-References

Previous Consistency Models — Strong, Eventual, Causal Consistency and CRDTs Explained Next CDN Deep Dive — Edge Servers, Cache Invalidation, and Geo-Routing Explained

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse System Design & Architecture