Learn Design a Web Crawler — System Design Guide

Design a Web Crawler — System Design Guide

DodaTech Updated Jun 15, 2026 6 min read

A web crawler (or spider) systematically browses the web to download and index pages. Search engines like Google crawl billions of pages to build their search index. Designing one tests your ability to build a distributed, polite, and scalable data collection system.

What You’ll Learn

You’ll master BFS traversal, URL frontier management, politeness policies, Bloom filter deduplication, page parsing, and distributed storage. You’ll build a complete crawler pipeline.

Why This Problem Matters

Web crawlers power search engines, data mining, price monitoring, and SEO tools. Google’s crawler processes trillions of URLs. Understanding crawler architecture teaches distributed queues, rate limiting, and massive-scale deduplication. At DodaTech, similar techniques power threat intelligence gathering in Durga Antivirus Pro.

Crawler Pipeline

    flowchart LR
  SEED[Seed URLs] --> Frontier[URL Frontier - Priority Queue]
  Frontier --> Fetcher[Fetcher Worker]
  Fetcher -->|Download| Parser[Parser]
  Parser -->|Extract links| Frontier
  Parser -->|Extract content| Storage[(Page Store)]
  Parser -->|Extract links| Dedup[Bloom Filter]
  Dedup -->|New URLs| Frontier
  Frontier --> Politeness[Politeness Queue]
  Politeness --> Fetcher

BFS Traversal

Web crawling uses BFS (Breadth-First Search) because:

It discovers pages in order of distance from seed URLs
It prioritizes important pages (closer to seeds)
It’s easier to parallelize than DFS

from collections import deque
import asyncio

class Crawler:
    def __init__(self):
        self.frontier = deque()
        self.visited = set()

    async def crawl(self, seed_urls: list):
        for url in seed_urls:
            self.frontier.append(url)

        while self.frontier:
            # Use BFS: pop from left
            url = self.frontier.popleft()
            if url in self.visited:
                continue
            self.visited.add(url)

            html = await self.fetch(url)
            links = self.parse_links(html, url)
            for link in links:
                if link not in self.visited:
                    self.frontier.append(link)

    async def fetch(self, url: str) -> str:
        # Respect robots.txt, rate limit, download
        pass

    def parse_links(self, html: str, base_url: str) -> list:
        # Extract href attributes, resolve relative URLs
        pass

URL Frontier

The frontier organizes URLs for polite, priority-aware crawling:

import asyncio
import heapq

class URLFrontier:
    def __init__(self):
        self.queues = {}  # hostname -> list of URLs
        self.wait_times = {}  # hostname -> next allowed fetch time

    async def add_url(self, url: str):
        hostname = urlparse(url).netloc
        if hostname not in self.queues:
            self.queues[hostname] = []
            self.wait_times[hostname] = 0
        self.queues[hostname].append(url)

    async def get_next_url(self) -> str:
        """Get the next URL respecting politeness delays."""
        now = time.time()
        # Find hostname whose wait time has passed
        for hostname, urls in self.queues.items():
            if urls and self.wait_times[hostname] <= now:
                url = urls.pop(0)
                self.wait_times[hostname] = now + 1.0  # 1 second delay
                return url
        return None

Politeness Policy

Don’t overwhelm servers. Per-host rate limiting:

Rule	Reason
Max 1 request per second per host	Avoid being blocked
Respect `robots.txt` crawl-delay	Legal and ethical requirement
Crawl during off-peak hours	Reduce server load
Handle 429/503 with exponential backoff	Server protection

async def fetch_polite(url: str, session, host_delay: dict):
    hostname = urlparse(url).netloc
    delay = host_delay.get(hostname, 1.0)
    await asyncio.sleep(delay)  # Rate limit per host
    try:
        async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
            if resp.status == 429:
                retry_after = int(resp.headers.get('Retry-After', 60))
                host_delay[hostname] = retry_after  # Back off
                return None
            return await resp.text()
    except Exception as e:
        return None

Bloom Filter for Deduplication

A Bloom filter is a space-efficient probabilistic data structure for testing set membership:

import mmh3
from bitarray import bitarray

class BloomFilter:
    def __init__(self, size: int = 10_000_000, num_hashes: int = 7):
        self.size = size
        self.num_hashes = num_hashes
        self.bits = bitarray(size)
        self.bits.setall(0)

    def add(self, item: str):
        for i in range(self.num_hashes):
            digest = mmh3.hash(item, i) % self.size
            self.bits[digest] = 1

    def might_contain(self, item: str) -> bool:
        for i in range(self.num_hashes):
            digest = mmh3.hash(item, i) % self.size
            if not self.bits[digest]:
                return False
        return True

# Usage: 1 billion URLs need ~1.8 GB Bloom filter with 1% false positive rate
bloom = BloomFilter(size=10_000_000, num_hashes=7)

if not bloom.might_contain(url):
    bloom.add(url)
    frontier.add_url(url)

Page Parsing and Storage

from bs4 import BeautifulSoup

def parse_page(html: str, base_url: str) -> tuple:
    """Return (text_content, links, metadata)."""
    soup = BeautifulSoup(html, 'html.parser')

    # Extract text (remove HTML tags)
    text = soup.get_text(separator=' ', strip=True)

    # Extract links
    links = []
    for a_tag in soup.find_all('a', href=True):
        href = urljoin(base_url, a_tag['href'])
        if href.startswith('http'):
            links.append(href)

    # Metadata
    title = soup.title.string if soup.title else ''
    meta_desc = ''
    meta_tag = soup.find('meta', attrs={'name': 'description'})
    if meta_tag:
        meta_desc = meta_tag.get('content', '')

    return text, links, {'title': title, 'description': meta_desc}

Common Mistakes

1. Not Respecting robots.txt

Ignoring robots.txt gets your IP banned and is unethical. Always fetch and parse robots.txt before crawling a domain.

2. Crawling Too Aggressively

Sending 100 requests/second to a small blog crashes it. Always rate limit per host. A good default is 1 request every 1-2 seconds.

3. Not Handling Redirects

HTTP 301/302 redirects are common. Follow them (up to 5 hops), but detect redirect loops with a visited set.

4. Storing Raw HTML Without Compression

Raw HTML is large. Compress with gzip before storing. Extract and index only the text content for search.

5. Using BFS Without Prioritization

Not all pages are equal. Prioritize pages from high-authority domains, recently updated pages, and pages linked from many sources.

6. No Checkpointing

If the crawler crashes mid-crawl without saving frontier state, you lose all progress. Serialize the frontier to disk every N URLs.

7. Crawling Duplicate Content

Identical pages on different URLs (URL parameters, session IDs) waste resources. Normalize URLs (remove ?utm_*, trailing slashes, #fragments).

Practice Questions

1. Why does a web crawler use BFS instead of DFS?

BFS discovers pages by distance from seeds, ensuring important pages are found first. DFS could get lost in a deep site structure.

2. What is the URL frontier and why is it important?

It’s the queue of URLs to crawl, organized for priority and politeness. Without it, a crawler has no systematic exploration strategy.

3. How does a Bloom filter help with deduplication?

It provides memory-efficient URL existence checking. A 10M-bit Bloom filter handles a billion URLs with ~1% false positive rate.

4. What is politeness policy and why is it necessary?

It limits request rate per host to avoid overloading servers. Without it, crawlers get blocked and disrupt websites.

5. Challenge: Implement priority crawling.

Score URLs by domain authority + freshness + link depth. Use a priority queue (heapq) instead of a simple deque. High-priority URLs are fetched first.

Mini Project: Focused Web Crawler

Build a focused crawler that:

Starts from 5 seed URLs
Uses BFS with politeness (1 req/sec per host)
Extracts all internal and external links
Filters links to a single domain (e.g., *.wikipedia.org)
Reports: total pages, average page size, depth distribution
Saves a sitemap JSON file at the end

What’s Next

Design a Rate Limiter

URL Shortener

Design Netflix

Congratulations on completing this Web Crawler design! Here’s where to go from here:

Practice daily — Design one crawler subsystem per day
Build a project — Implement a working focused crawler
Explore related topics — Distributed crawling, page rank, search indexing
Join the community — Share your crawler designs and get feedback

Remember: every expert was once a beginner. Keep crawling!

Previous Design Pastebin: Architecture for Text Sharing Next Design a Chat System — WhatsApp/Messenger Architecture

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse System Design Problems