Skip to content

Rate Limiting and Throttling for APIs — Complete Guide

DodaTech Updated 2026-06-23 7 min read

Rate Limiting is a technique that controls the number of requests a client can make to an API within a specific time window, preventing abuse, ensuring fair usage, and protecting backend resources from overload.

What You'll Learn

You will learn four Rate Limiting algorithms token bucket, leaky bucket, fixed window, and Sliding Window along with implementation examples, header conventions, and distributed Rate Limiting strategies.

Why Rate Limiting Matters

Without Rate Limiting, a single misbehaving client can consume all API resources, causing degraded performance for all other users. Rate Limiting protects against DDoS attacks, brute force login attempts, and runaway scripts. It also enables tiered pricing models where premium customers get higher limits.

Real-World Use

DodaTech implements Rate Limiting across all products. Doda Browser sync API allows 100 requests per minute for free users and 1000 for premium, DodaZIP update service uses Sliding Window limits for fair bandwidth distribution, and Durga Antivirus Pro threat reporting applies strict limits on submission endpoints.

Rate Limiting Learning Path

flowchart LR
  A[Api Design Basics] --> B[Why Rate Limit?]
  B --> C[Token Bucket]
  B --> D[Leaky Bucket]
  B --> E[Fixed Window]
  B --> F[Sliding Window]
  C --> G[Implementation]
  D --> G
  E --> G
  F --> G
  B:::current

  classDef current fill:#f90,color:#fff,stroke:#333,stroke-width:2px

Prerequisites

Understand RESTful Api Design Best Practices and API Security Best Practices. Familiarity with Authentication Patterns JWT OAuth2 API Keys is helpful. Basic knowledge of Redis or in-memory data stores is recommended.

Algorithm 1: Token Bucket

The token bucket algorithm maintains a bucket of tokens that refills at a fixed rate.

How It Works

  • A bucket holds a maximum number of tokens (burst limit)
  • Tokens are added at a steady rate (refill rate)
  • Each request consumes one token
  • If the bucket is empty, the request is rejected
import time
import threading

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity  # Max tokens (burst)
        self.tokens = capacity   # Current tokens
        self.refill_rate = refill_rate  # Tokens per second
        self.last_refill = time.time()
        self.lock = threading.Lock()

    def allow_request(self):
        with self.lock:
            now = time.time()
            elapsed = now - self.last_refill
            self.tokens = min(self.capacity,
                              self.tokens + elapsed * self.refill_rate)
            self.last_refill = now

            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

# Usage: 10 requests per second, burst up to 20
bucket = TokenBucket(capacity=20, refill_rate=10)

for i in range(25):
    if bucket.allow_request():
        print(f"Request {i+1}: Allowed")
    else:
        print(f"Request {i+1}: Rate limited")

Expected output:

Request 1: Allowed
Request 2: Allowed
...
Request 20: Allowed
Request 21: Rate limited
Request 22: Rate limited
...

Algorithm 2: Leaky Bucket

The leaky bucket algorithm processes requests at a constant rate, smoothing out traffic bursts.

How It Works

  • Requests enter a Queue (bucket)
  • Requests are processed at a fixed rate (leak rate)
  • If the Queue is full, new requests are rejected
from collections import deque
import time

class LeakyBucket:
    def __init__(self, capacity, leak_rate):
        self.capacity = capacity
        self.queue = deque()
        self.leak_rate = leak_rate  # Requests processed per second
        self.last_leak = time.time()

    def allow_request(self):
        now = time.time()
        elapsed = now - self.last_leak
        leaks = int(elapsed * self.leak_rate)

        for _ in range(min(leaks, len(self.queue))):
            self.queue.popleft()

        self.last_leak = now

        if len(self.queue) < self.capacity:
            self.queue.append(now)
            return True
        return False

Algorithm 3: Fixed Window

The fixed window algorithm counts requests in discrete time windows.

import time

class FixedWindow:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.window_start = time.time()
        self.count = 0

    def allow_request(self):
        now = time.time()

        if now - self.window_start >= self.window_seconds:
            self.window_start = now
            self.count = 0

        if self.count < self.limit:
            self.count += 1
            return True
        return False

# Usage: 100 requests per minute
limiter = FixedWindow(limit=100, window_seconds=60)

Problem: At window boundaries, clients can double the request rate. If the limit is 100 per minute, a client can send 100 requests at 00:59 and 100 more at 01:00.

Algorithm 4: Sliding Window

The Sliding Window algorithm solves the fixed window boundary problem by tracking timestamps within a rolling window.

from collections import deque
import time

class SlidingWindow:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.timestamps = deque()

    def allow_request(self):
        now = time.time()

        # Remove timestamps outside the window
        while self.timestamps and self.timestamps[0] < now - self.window_seconds:
            self.timestamps.popleft()

        if len(self.timestamps) < self.limit:
            self.timestamps.append(now)
            return True
        return False

Express.js Rate Limiting with express-rate-limit

const rateLimit = require("express-rate-limit");

const limiter = rateLimit({
  windowMs: 60 * 1000,  // 1 minute
  max: 100,             // 100 requests per minute
  message: {
    error: "Too many requests",
    retryAfter: "60 seconds"
  },
  headers: true,        // Send rate limit headers
  keyGenerator: (req) => {
    return req.ip;       // Rate limit by IP
  }
});

// Apply to all routes
app.use("/api", limiter);

// Stricter limits for auth endpoints
const authLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,  // 15 minutes
  max: 5,                     // 5 attempts
  message: { error: "Too many login attempts" }
});

app.use("/api/auth/login", authLimiter);

Distributed Rate Limiting with Redis

For applications running on multiple servers, use Redis for centralized Rate Limiting.

const redis = require("redis");
const client = redis.createClient();

async function slidingWindowRedis(userId, limit, windowSeconds) {
  const now = Date.now();
  const key = `rate_limit:${userId}`;
  const windowStart = now - windowSeconds * 1000;

  // Remove old timestamps
  await client.zRemRangeByScore(key, 0, windowStart);
  // Count requests in window
  const count = await client.zCard(key);

  if (count < limit) {
    await client.zAdd(key, { score: now, value: `${now}` });
    await client.expire(key, windowSeconds);
    return true;  // Allowed
  }

  return false;  // Rate limited
}

Response Headers

Standard rate limit headers inform clients of their usage:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1719104400

When rate limited (HTTP 429):

HTTP/1.1 429 Too Many Requests
Retry-After: 60
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0

{
  "error": "Rate limit exceeded",
  "retryAfter": 60,
  "limit": 100,
  "window": "1 minute"
}

Common Errors

  1. Rate Limiting by IP only — IP-based limiting catches multiple users behind the same NAT. Combine IP and user-based limiting. Use authenticated user IDs as primary keys.

  2. Not including Retry-After header — Failing to tell clients when to retry. Always include the Retry-After header in 429 responses so clients can implement exponential backoff.

  3. Window boundary spikes — Using fixed window algorithm allows double bursts at boundaries. Use Sliding Window or token bucket for smooth rate enforcement.

  4. No distributed coordination — Running rate limiters on each server independently. A client hitting multiple servers can exceed the limit. Use Redis or similar centralized store.

  5. Rate Limiting all endpoints equally — Auth endpoints need stricter limits than read endpoints. Apply different limits based on endpoint sensitivity.

  6. Not logging rate limit eventsRate Limiting without monitoring hides abuse patterns. Log every rate limit trigger with client ID and endpoint.

  7. Too aggressive throttling — Setting limits too low and blocking legitimate users. Monitor real usage patterns before setting limits. Start generous and tighten over time.

Practice Questions

  1. How does the token bucket algorithm differ from the leaky bucket algorithm?
  2. What problem does the Sliding Window algorithm solve that fixed window does not?
  3. Why is Redis necessary for distributed Rate Limiting?
  4. What headers should a rate-limited API response include?
  5. How should rate limits differ between auth and data endpoints?

Challenge

Implement a multi-tier Rate Limiting system for a SaaS API. Free tier: 10 requests per minute, 100 per hour. Pro tier: 100 requests per minute, 10000 per hour. Enterprise tier: 1000 requests per minute, unlimited hourly. Use Redis for distributed State, token bucket algorithm for smooth rate enforcement, and return proper rate limit headers. Include a mechanism for clients to check their current usage without consuming a request.

FAQ

What HTTP status code indicates Rate Limiting? HTTP 429 Too Many Requests is the standard status code for rate-limited requests. Always include a Retry-After header so clients know when to retry.

Should I rate limit authenticated and unauthenticated requests differently? Yes. Unauthenticated requests should have lower limits (10-20 per minute) to prevent anonymous abuse. Authenticated users can have higher limits based on their subscription tier.

How do I handle Rate Limiting for mobile apps? Use device IDs or authenticated user IDs instead of IP addresses. Mobile IPs change frequently due to network switching. Include offline rate limit tracking to prevent burst when connectivity returns.

What is the difference between Rate Limiting and throttling? Rate Limiting rejects requests that exceed the limit. Throttling slows down requests (queues them) but eventually processes them. Rate Limiting uses algorithms like fixed window. Throttling uses leaky bucket.

Can Rate Limiting prevent DDoS attacks? Rate Limiting mitigates application-layer DDoS attacks by limiting request rates per client. It does not prevent network-layer DDoS attacks. Use a dedicated DDoS protection service for network-layer attacks.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro