Rate Limiting — Token Bucket, Leaky Bucket, and Sliding Window Algorithms Explained
Rate limiting is a technique that controls the number of requests a client can make to an API within a specific time window, preventing abuse, ensuring fair usage, and protecting backend resources from overload.
Why Rate Limiting Matters
Without rate limiting, a single malicious or buggy client can saturate your API servers and deny service to everyone else. Twitter’s API rate limits ensure that no single app consumes more than its share. GitHub’s rate limit prevents scrapers from crawling all public repositories in minutes. Rate limiting is the first line of defense against DDoS attacks, brute-force login attempts, and runaway billing from misconfigured clients. API gateways like Kong, NGINX, and AWS API Gateway all include built-in rate limiting.
Plain-Language Explanation
Imagine a water park with a popular waterslide. Without any control, everyone rushes the slide at once — long lines, safety hazards, unhappy guests. The park installs a gate that lets one person through every 10 seconds. The line still forms, but the slide operates smoothly at its designed capacity.
In API terms, the waterslide is your server. Each rider is an API request. The gate is your rate limiter. It ensures your server processes requests at a sustainable rate, regardless of how many arrive. Clients that exceed the limit get a 429 Too Many Requests response.
graph LR
Client[Client] --> RL[Rate Limiter]
RL -->|Under limit| API[API Server]
RL -->|Over limit| Blocked[429 Too Many Requests]
RL --> Redis[(Redis
Counter Storage)]
subgraph Rate Limiting Algorithms
TB[Token Bucket]
LB[Leaky Bucket]
SW[Sliding Window]
end
RL --> TB
RL --> LB
RL --> SW
style RL fill:#e74c3c,color:#fff
style API fill:#27ae60,color:#fff
style Blocked fill:#95a5a6,color:#fff
Token Bucket Algorithm
A bucket holds tokens. Tokens are added at a fixed rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is denied. The bucket has a maximum capacity (burst size).
import time
class TokenBucket:
def __init__(self, rate: float, capacity: int):
self.rate = rate # Tokens added per second
self.capacity = capacity # Maximum tokens
self.tokens = capacity # Start full
self.last_refill = time.time()
def allow(self) -> bool:
now = time.time()
# Refill tokens based on elapsed time
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
bucket = TokenBucket(rate=10, capacity=20) # 10 req/s, burst up to 20
for i in range(25):
allowed = bucket.allow()
print(f"Request {i+1}: {'✓ Allowed' if allowed else '✗ Blocked'}")
time.sleep(0.05) # 50ms between requestsExpected output:
Request 1: ✓ Allowed
...
Request 20: ✓ Allowed (burst exhausted)
Request 21: ✓ Allowed (refilled during delay)
...Leaky Bucket Algorithm
Requests enter a queue (bucket) and leak out at a fixed rate for processing. If the bucket overflows, excess requests are rejected.
import time
from collections import deque
class LeakyBucket:
def __init__(self, capacity: int, leak_rate: float):
self.capacity = capacity # Max queue size
self.leak_rate = leak_rate # Requests processed per second
self.queue = deque()
self.last_leak = time.time()
def allow(self) -> bool:
now = time.time()
# Process (leak) requests at the fixed rate
elapsed = now - self.last_leak
processed = int(elapsed * self.leak_rate)
for _ in range(min(processed, len(self.queue))):
self.queue.popleft()
self.last_leak = now
if len(self.queue) < self.capacity:
self.queue.append(time.time())
return True
return False
bucket = LeakyBucket(capacity=5, leak_rate=2)
for i in range(10):
allowed = bucket.allow()
print(f"Request {i+1}: {'✓ Allowed' if allowed else '✗ Blocked'}")
time.sleep(0.1)Expected output:
Request 1: ✓ Allowed
Request 2: ✓ Allowed
Request 3: ✓ Allowed
Request 4: ✓ Allowed
Request 5: ✓ Allowed
Request 6: ✗ Blocked
Request 7: ✗ BlockedSliding Window Log
Track a timestamp for each request in a sorted set. When a new request arrives, remove timestamps older than the window, then check if the count exceeds the limit.
import time
class SlidingWindowLog:
def __init__(self, limit: int, window: int):
self.limit = limit # Max requests
self.window = window # Window size in seconds
self.log = [] # Sorted request timestamps
def allow(self) -> bool:
now = time.time()
cutoff = now - self.window
# Remove timestamps outside the window
self.log = [t for t in self.log if t > cutoff]
self.log.append(now)
return len(self.log) <= self.limit
limiter = SlidingWindowLog(limit=5, window=10) # 5 requests per 10 seconds
for i in range(8):
allowed = limiter.allow()
print(f"Request {i+1}: {'✓ Allowed' if allowed else '✗ Blocked'}")
time.sleep(0.5)Expected output:
Request 1: ✓ Allowed
Request 2: ✓ Allowed
Request 3: ✓ Allowed
Request 4: ✓ Allowed
Request 5: ✓ Allowed
Request 6: ✗ Blocked
Request 7: ✗ Blocked
Request 8: ✗ BlockedRate Limiting in API Gateways
NGINX rate limiting configuration:
http {
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
server {
location /api/ {
limit_req zone=api_limit burst=20 nodelay;
proxy_pass http://backend;
}
}
}This limits each client IP to 10 requests per second with a burst of 20. Beyond that, requests are queued (burst) or rejected.
Common Mistakes
Rate limiting based on IP only: NAT and shared IPs cause false positives. Use API keys or authenticated user IDs as the rate limit key.
No burst allowance: Pure rate limiting that blocks immediately creates bad UX. Allow short bursts with token bucket’s capacity.
Inconsistent rate limit headers: Always return
X-RateLimit-Limit,X-RateLimit-Remaining, andX-RateLimit-Resetso clients can back off intelligently.Global rate limit without per-endpoint limits: A search endpoint and a login endpoint have different cost profiles. Different limits per endpoint.
Rate limiting in only one layer: Apply rate limiting at both the API gateway and application layer for defense in depth.
Practice Questions
What is the difference between token bucket and leaky bucket? Token bucket controls the rate by replenishing tokens — bursts are allowed up to bucket capacity. Leaky bucket controls the rate by processing at a fixed rate — the queue smooths bursts but adds latency.
Why return 429 instead of dropping connections silently? 429 includes a
Retry-Afterheader, letting clients know when to retry. Dropping silently causes clients to retry immediately, making the problem worse.How does sliding window improve over fixed window? Fixed window can allow 2x traffic at window boundaries (all requests at the end of one window and start of the next). Sliding window smooths this by checking a rolling time frame.
What is a distributed rate limiter? A rate limiter that uses a shared store (Redis) so all API server instances enforce the same limits. Prevents a client from hitting 10 requests per second by rotating across 10 server instances.
How do you handle rate limiting for authenticated vs unauthenticated users? Authenticated users get higher limits tied to their API key. Unauthenticated users share a pool of capacity or get strict IP-based limits to prevent abuse.
Mini Project
Build a Redis-backed distributed rate limiter:
import redis, time, hashlib
r = redis.Redis(decode_responses=True)
def check_rate_limit(user_id: str, limit: int = 10, window: int = 60) -> bool:
key = f"ratelimit:{user_id}:{int(time.time() / window)}"
count = r.incr(key)
if count == 1:
r.expire(key, window)
return count <= limit
# Simulate requests
for i in range(15):
allowed = check_rate_limit("user_42", limit=10, window=60)
print(f"Request {i+1}: {'✓' if allowed else '✗'}")
time.sleep(0.01)Expected output:
Request 1: ✓
...
Request 10: ✓
Request 11: ✗
...Cross-References
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro