Design Pastebin: Architecture for Text Sharing
Designing Pastebin means building a service where users paste text, get a shareable URL, and optionally set expiration — while handling millions of pastes and reads per day with low latency.
What You’ll Learn
You’ll master write-heavy vs read-heavy tradeoffs, blob storage architecture, URL generation strategies, expiration policies, and rate limiting. You’ll build a complete Pastebin architecture.
Why This Problem Matters
Pastebin and similar services (GitHub Gist, hastebin) handle billions of pastes. The core challenge is balancing durability (don’t lose pastes) with cost (storage is expensive) and performance (pastes must load fast). At DodaTech, similar blob storage patterns handle encrypted data in DodaZIP and signature caches in Durga Antivirus Pro.
System Design Learning Path
flowchart LR
A[System Design Overview] --> B[URL Shortener]
B --> C[Pastebin]
C --> D{You Are Here}
D --> E[Video Streaming]
D --> F[Ride Sharing]
style D fill:#f90,color:#fff
Requirements
Functional:
- Create a paste with text content, optional expiration, optional password
- Read a paste by its unique URL
- List recent pastes for a user (optional auth)
- Report/flag inappropriate content (optional)
Non-functional:
- High durability (no lost pastes)
- Fast reads (< 100ms for cached pastes)
- Support pastes up to 10MB
- Scale: 1M pastes/day, 10M reads/day
System Architecture
flowchart TB
User[User] -->|Create/Read| LB[Load Balancer]
LB --> Web[Web Servers]
Web --> Cache[(Redis Cache)]
Web --> MetaDB[(Metadata DB - SQL)]
Web --> Blob[(Blob Store - S3)]
Web --> Analytics[(Analytics Pipeline)]
Web --> RateLimiter[Rate Limiter]
Cache --> MetaDB
RateLimiter --> Redis2[(Redis - Counter)]
Web -->|Generate URL| URLGen[URL Generator]
Read-Heavy vs Write-Heavy
| Aspect | Read-Heavy (Posts) | Write-Heavy (Analytics) |
|---|---|---|
| Ratio | 10:1 reads to writes | 1:10 reads to writes |
| Cache Strategy | Aggressive caching (CDN + Redis) | Minimal caching, batch writes |
| DB Pattern | Read replicas, denormalized | Append-only, columnar store |
| Optimization | Fast reads, tolerate slow writes | Fast ingestion, tolerate slow reads |
Pastebin is read-heavy — pastes are shared and viewed many times, but written once.
URL Generation
import hashlib, base64, time
def generate_url(content: str, salt: str = "") -> str:
# Hash content + timestamp + salt for uniqueness
unique = f"{content}{time.time_ns()}{salt}"
hash_bytes = hashlib.sha256(unique.encode()).digest()
# Base62 encode first 8 bytes for 11-character URL
encoded = base64.b64encode(hash_bytes[:8]).decode().rstrip("=")
# Replace + and / with URL-safe chars
return encoded.replace("+", "A").replace("/", "B")[:8]
def validate_url(short: str) -> bool:
return len(short) == 8 and all(c.isalnum() for c in short)
# Examples
for content in ["Hello World", "Code snippet", "Long document"]:
url = generate_url(content)
print(f"Content: '{content}' → URL: {url}")Output:
Content: 'Hello World' → URL: wT7XpB2k
Content: 'Code snippet' → URL: mR9sK4nL
Content: 'Long document' → URL: fG5hJ8qWCollision Handling
Even with SHA-256, 8-character Base62 URLs have 62^8 ≈ 218 trillion combinations — collisions are astronomically unlikely. But to be safe:
- Check if generated URL exists in DB
- If collision detected, append a counter and re-hash
- Use a distributed ID (Snowflake) as the seed instead
Storage Architecture
| Data Type | Storage | Reason |
|---|---|---|
| Paste content | Blob store (S3, GCS) | Cheaper than DB for large text |
| Metadata | SQL (PostgreSQL) | Relational queries, indexes |
| Cache | Redis | Fast reads for hot pastes |
| Analytics | ClickHouse or timescale | High write throughput for views |
Metadata Schema
CREATE TABLE pastes (
id BIGSERIAL PRIMARY KEY,
short_url VARCHAR(8) UNIQUE NOT NULL,
user_id BIGINT REFERENCES users(id),
title VARCHAR(255),
content_url TEXT NOT NULL, -- S3 key
content_length INT NOT NULL,
content_type VARCHAR(50) DEFAULT 'text/plain',
password_hash VARCHAR(255),
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP,
view_count INT DEFAULT 0,
is_flagged BOOLEAN DEFAULT FALSE
);
CREATE INDEX idx_short_url ON pastes(short_url);
CREATE INDEX idx_user_id ON pastes(user_id);
CREATE INDEX idx_expires_at ON pastes(expires_at);Expiration Policies
| Policy | Implementation | Storage Saving |
|---|---|---|
| TTL-based | Set expires_at column, periodic cleanup | High (cold data removed) |
| LRU eviction | Evict least recently viewed pastes | Medium |
| Size-based | Pastes > 1MB expire in 24h | Low |
| Never expire | Pro accounts, permanent pastes | None |
from datetime import datetime, timedelta
class ExpirationManager:
def __init__(self, db_connection):
self.db = db_connection
def cleanup_expired(self):
query = """
DELETE FROM pastes
WHERE expires_at IS NOT NULL
AND expires_at < NOW()
RETURNING short_url, content_url
"""
expired = self.db.execute(query)
for paste in expired:
self._delete_from_blob(paste['content_url'])
self._evict_from_cache(paste['short_url'])
return len(expired)
def _delete_from_blob(self, url: str):
print(f"Deleting blob: {url}")
def _evict_from_cache(self, short_url: str):
print(f"Evicting cache: paste:{short_url}")
manager = ExpirationManager(None)
count = manager.cleanup_expired()
print(f"Cleaned up {count} expired pastes")Rate Limiting
Pastebin needs rate limiting to prevent abuse (spam, DDoS):
import time
from collections import defaultdict
class SlidingWindowRateLimiter:
def __init__(self, max_requests: int = 10, window_seconds: int = 60):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.requests = defaultdict(list)
def allow(self, client_id: str) -> bool:
now = time.time()
window_start = now - self.window_seconds
client_requests = self.requests[client_id]
# Remove old requests
self.requests[client_id] = [t for t in client_requests if t > window_start]
if len(self.requests[client_id]) >= self.max_requests:
return False
self.requests[client_id].append(now)
return True
limiter = SlidingWindowRateLimiter(max_requests=5, window_seconds=60)
client = "user-42"
for i in range(7):
allowed = limiter.allow(client)
print(f"Request {i+1}: {'✓ allowed' if allowed else '✗ rate limited'}")CDN for Content Delivery
Pastebin benefits from CDN for cached paste content:
| CDN Use | Cache Hit | Benefit |
|---|---|---|
| Paste HTML pages | High (pastes shared widely) | Reduced origin load |
| Static assets (CSS, JS) | Very high | Fast page load |
| API responses | Medium (depends on popularity) | Lower latency |
Hot pastes (millions of views) should be cached at the CDN edge. Expired pastes invalidate cache via CDN purge API.
Common Errors
Storing content in SQL directly: Large text blobs in relational databases kill query performance. Always store content in a blob store, with metadata in SQL.
No content deduplication: If the same text is pasted 1000 times, you store it 1000 times. Hash content and store once with a reference counter.
URL collision at scale: With 1M pastes/day, after 5 years you have ~1.8B URLs — 8-char Base62 may have collisions. Monitor and expand to 10 chars proactively.
Forgetting expiration cleanup: Without a background job deleting expired pastes, storage grows unbounded. Run cleanup every 15 minutes.
No abuse prevention: Without rate limiting, a single script can create millions of pastes (spam, malware hosting). Always rate limit per IP and per user.
Mini Project
Build a simple Pastebin API:
import hashlib, json, time
from datetime import datetime, timedelta
class SimplePastebin:
def __init__(self):
self.store = {}
self.metadata = {}
def create(self, content: str, expires_in_hours: int = 24) -> str:
content_hash = hashlib.md5(content.encode()).hexdigest()[:8]
timestamp = str(time.time_ns())
url = hashlib.sha256(f"{content_hash}{timestamp}".encode()).hexdigest()[:8]
self.store[url] = content
self.metadata[url] = {
"created_at": datetime.now().isoformat(),
"expires_at": (datetime.now() + timedelta(hours=expires_in_hours)).isoformat(),
"size": len(content),
}
return url
def get(self, url: str) -> dict:
if url not in self.store:
return {"error": "Not found"}
meta = self.metadata[url]
if datetime.fromisoformat(meta["expires_at"]) < datetime.now():
del self.store[url]
del self.metadata[url]
return {"error": "Expired"}
return {"content": self.store[url], "metadata": meta}
api = SimplePastebin()
url = api.create("def hello():\n print('Hello, Pastebin!')", expires_in_hours=1)
print(f"Paste URL: {url}")
paste = api.get(url)
print(f"Content: {paste['content'][:50]}...")
print(f"Expires: {paste['metadata']['expires_at']}")Cross-References
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro