Learn System: Rate Limiting — Token Bucket, Leaky Bucket, and Sliding Window Algorithms Explained

Rate Limiting — Token Bucket, Leaky Bucket, and Sliding Window Algorithms Explained

DodaTech Updated Jun 15, 2026 6 min read

Rate limiting is a technique that controls the number of requests a client can make to an API within a specific time window, preventing abuse, ensuring fair usage, and protecting backend resources from overload.

Why Rate Limiting Matters

Without rate limiting, a single malicious or buggy client can saturate your API servers and deny service to everyone else. Twitter’s API rate limits ensure that no single app consumes more than its share. GitHub’s rate limit prevents scrapers from crawling all public repositories in minutes. Rate limiting is the first line of defense against DDoS attacks, brute-force login attempts, and runaway billing from misconfigured clients. API gateways like Kong, NGINX, and AWS API Gateway all include built-in rate limiting.

Plain-Language Explanation

Imagine a water park with a popular waterslide. Without any control, everyone rushes the slide at once — long lines, safety hazards, unhappy guests. The park installs a gate that lets one person through every 10 seconds. The line still forms, but the slide operates smoothly at its designed capacity.

In API terms, the waterslide is your server. Each rider is an API request. The gate is your rate limiter. It ensures your server processes requests at a sustainable rate, regardless of how many arrive. Clients that exceed the limit get a 429 Too Many Requests response.


graph LR
    Client[Client] --> RL[Rate Limiter]
    RL -->|Under limit| API[API Server]
    RL -->|Over limit| Blocked[429 Too Many Requests]
    RL --> Redis[(Redis
Counter Storage)]
    subgraph Rate Limiting Algorithms
        TB[Token Bucket]
        LB[Leaky Bucket]
        SW[Sliding Window]
    end
    RL --> TB
    RL --> LB
    RL --> SW
    style RL fill:#e74c3c,color:#fff
    style API fill:#27ae60,color:#fff
    style Blocked fill:#95a5a6,color:#fff

Token Bucket Algorithm

A bucket holds tokens. Tokens are added at a fixed rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is denied. The bucket has a maximum capacity (burst size).

import time

class TokenBucket:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate          # Tokens added per second
        self.capacity = capacity  # Maximum tokens
        self.tokens = capacity    # Start full
        self.last_refill = time.time()

    def allow(self) -> bool:
        now = time.time()
        # Refill tokens based on elapsed time
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

bucket = TokenBucket(rate=10, capacity=20)  # 10 req/s, burst up to 20

for i in range(25):
    allowed = bucket.allow()
    print(f"Request {i+1}: {'✓ Allowed' if allowed else '✗ Blocked'}")
    time.sleep(0.05)  # 50ms between requests

Expected output:

Request 1: ✓ Allowed
...
Request 20: ✓ Allowed (burst exhausted)
Request 21: ✓ Allowed (refilled during delay)
...

Leaky Bucket Algorithm

Requests enter a queue (bucket) and leak out at a fixed rate for processing. If the bucket overflows, excess requests are rejected.

import time
from collections import deque

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity      # Max queue size
        self.leak_rate = leak_rate    # Requests processed per second
        self.queue = deque()
        self.last_leak = time.time()

    def allow(self) -> bool:
        now = time.time()
        # Process (leak) requests at the fixed rate
        elapsed = now - self.last_leak
        processed = int(elapsed * self.leak_rate)
        for _ in range(min(processed, len(self.queue))):
            self.queue.popleft()
        self.last_leak = now

        if len(self.queue) < self.capacity:
            self.queue.append(time.time())
            return True
        return False

bucket = LeakyBucket(capacity=5, leak_rate=2)

for i in range(10):
    allowed = bucket.allow()
    print(f"Request {i+1}: {'✓ Allowed' if allowed else '✗ Blocked'}")
    time.sleep(0.1)

Expected output:

Request 1: ✓ Allowed
Request 2: ✓ Allowed
Request 3: ✓ Allowed
Request 4: ✓ Allowed
Request 5: ✓ Allowed
Request 6: ✗ Blocked
Request 7: ✗ Blocked

Sliding Window Log

Track a timestamp for each request in a sorted set. When a new request arrives, remove timestamps older than the window, then check if the count exceeds the limit.

import time

class SlidingWindowLog:
    def __init__(self, limit: int, window: int):
        self.limit = limit           # Max requests
        self.window = window         # Window size in seconds
        self.log = []                # Sorted request timestamps

    def allow(self) -> bool:
        now = time.time()
        cutoff = now - self.window
        # Remove timestamps outside the window
        self.log = [t for t in self.log if t > cutoff]
        self.log.append(now)
        return len(self.log) <= self.limit

limiter = SlidingWindowLog(limit=5, window=10)  # 5 requests per 10 seconds

for i in range(8):
    allowed = limiter.allow()
    print(f"Request {i+1}: {'✓ Allowed' if allowed else '✗ Blocked'}")
    time.sleep(0.5)

Expected output:

Request 1: ✓ Allowed
Request 2: ✓ Allowed
Request 3: ✓ Allowed
Request 4: ✓ Allowed
Request 5: ✓ Allowed
Request 6: ✗ Blocked
Request 7: ✗ Blocked
Request 8: ✗ Blocked

Rate Limiting in API Gateways

NGINX rate limiting configuration:

http {
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    server {
        location /api/ {
            limit_req zone=api_limit burst=20 nodelay;
            proxy_pass http://backend;
        }
    }
}

This limits each client IP to 10 requests per second with a burst of 20. Beyond that, requests are queued (burst) or rejected.

Common Mistakes

Rate limiting based on IP only: NAT and shared IPs cause false positives. Use API keys or authenticated user IDs as the rate limit key.
No burst allowance: Pure rate limiting that blocks immediately creates bad UX. Allow short bursts with token bucket’s capacity.
Inconsistent rate limit headers: Always return X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset so clients can back off intelligently.
Global rate limit without per-endpoint limits: A search endpoint and a login endpoint have different cost profiles. Different limits per endpoint.
Rate limiting in only one layer: Apply rate limiting at both the API gateway and application layer for defense in depth.

Practice Questions

What is the difference between token bucket and leaky bucket? Token bucket controls the rate by replenishing tokens — bursts are allowed up to bucket capacity. Leaky bucket controls the rate by processing at a fixed rate — the queue smooths bursts but adds latency.
Why return 429 instead of dropping connections silently? 429 includes a Retry-After header, letting clients know when to retry. Dropping silently causes clients to retry immediately, making the problem worse.
How does sliding window improve over fixed window? Fixed window can allow 2x traffic at window boundaries (all requests at the end of one window and start of the next). Sliding window smooths this by checking a rolling time frame.
What is a distributed rate limiter? A rate limiter that uses a shared store (Redis) so all API server instances enforce the same limits. Prevents a client from hitting 10 requests per second by rotating across 10 server instances.
How do you handle rate limiting for authenticated vs unauthenticated users? Authenticated users get higher limits tied to their API key. Unauthenticated users share a pool of capacity or get strict IP-based limits to prevent abuse.

Mini Project

Build a Redis-backed distributed rate limiter:

import redis, time, hashlib

r = redis.Redis(decode_responses=True)

def check_rate_limit(user_id: str, limit: int = 10, window: int = 60) -> bool:
    key = f"ratelimit:{user_id}:{int(time.time() / window)}"
    count = r.incr(key)
    if count == 1:
        r.expire(key, window)
    return count <= limit

# Simulate requests
for i in range(15):
    allowed = check_rate_limit("user_42", limit=10, window=60)
    print(f"Request {i+1}: {'✓' if allowed else '✗'}")
    time.sleep(0.01)

Expected output:

Request 1: ✓
...
Request 10: ✓
Request 11: ✗
...

Cross-References

Previous Message Queues — Pub/Sub Architecture with RabbitMQ, Kafka, and SQS Explained Next Microservices Design Patterns — Service Discovery, Circuit Breaker, Saga, CQRS Explained

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse System Design & Architecture