Design a Web Crawler — System Design Guide
A web crawler (or spider) systematically browses the web to download and index pages. Search engines like Google crawl billions of pages to build their search index. Designing one tests your ability to build a distributed, polite, and scalable data collection system.
What You’ll Learn
You’ll master BFS traversal, URL frontier management, politeness policies, Bloom filter deduplication, page parsing, and distributed storage. You’ll build a complete crawler pipeline.
Why This Problem Matters
Web crawlers power search engines, data mining, price monitoring, and SEO tools. Google’s crawler processes trillions of URLs. Understanding crawler architecture teaches distributed queues, rate limiting, and massive-scale deduplication. At DodaTech, similar techniques power threat intelligence gathering in Durga Antivirus Pro.
Crawler Pipeline
flowchart LR
SEED[Seed URLs] --> Frontier[URL Frontier - Priority Queue]
Frontier --> Fetcher[Fetcher Worker]
Fetcher -->|Download| Parser[Parser]
Parser -->|Extract links| Frontier
Parser -->|Extract content| Storage[(Page Store)]
Parser -->|Extract links| Dedup[Bloom Filter]
Dedup -->|New URLs| Frontier
Frontier --> Politeness[Politeness Queue]
Politeness --> Fetcher
BFS Traversal
Web crawling uses BFS (Breadth-First Search) because:
- It discovers pages in order of distance from seed URLs
- It prioritizes important pages (closer to seeds)
- It’s easier to parallelize than DFS
from collections import deque
import asyncio
class Crawler:
def __init__(self):
self.frontier = deque()
self.visited = set()
async def crawl(self, seed_urls: list):
for url in seed_urls:
self.frontier.append(url)
while self.frontier:
# Use BFS: pop from left
url = self.frontier.popleft()
if url in self.visited:
continue
self.visited.add(url)
html = await self.fetch(url)
links = self.parse_links(html, url)
for link in links:
if link not in self.visited:
self.frontier.append(link)
async def fetch(self, url: str) -> str:
# Respect robots.txt, rate limit, download
pass
def parse_links(self, html: str, base_url: str) -> list:
# Extract href attributes, resolve relative URLs
passURL Frontier
The frontier organizes URLs for polite, priority-aware crawling:
import asyncio
import heapq
class URLFrontier:
def __init__(self):
self.queues = {} # hostname -> list of URLs
self.wait_times = {} # hostname -> next allowed fetch time
async def add_url(self, url: str):
hostname = urlparse(url).netloc
if hostname not in self.queues:
self.queues[hostname] = []
self.wait_times[hostname] = 0
self.queues[hostname].append(url)
async def get_next_url(self) -> str:
"""Get the next URL respecting politeness delays."""
now = time.time()
# Find hostname whose wait time has passed
for hostname, urls in self.queues.items():
if urls and self.wait_times[hostname] <= now:
url = urls.pop(0)
self.wait_times[hostname] = now + 1.0 # 1 second delay
return url
return NonePoliteness Policy
Don’t overwhelm servers. Per-host rate limiting:
| Rule | Reason |
|---|---|
| Max 1 request per second per host | Avoid being blocked |
Respect robots.txt crawl-delay | Legal and ethical requirement |
| Crawl during off-peak hours | Reduce server load |
| Handle 429/503 with exponential backoff | Server protection |
async def fetch_polite(url: str, session, host_delay: dict):
hostname = urlparse(url).netloc
delay = host_delay.get(hostname, 1.0)
await asyncio.sleep(delay) # Rate limit per host
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
if resp.status == 429:
retry_after = int(resp.headers.get('Retry-After', 60))
host_delay[hostname] = retry_after # Back off
return None
return await resp.text()
except Exception as e:
return NoneBloom Filter for Deduplication
A Bloom filter is a space-efficient probabilistic data structure for testing set membership:
import mmh3
from bitarray import bitarray
class BloomFilter:
def __init__(self, size: int = 10_000_000, num_hashes: int = 7):
self.size = size
self.num_hashes = num_hashes
self.bits = bitarray(size)
self.bits.setall(0)
def add(self, item: str):
for i in range(self.num_hashes):
digest = mmh3.hash(item, i) % self.size
self.bits[digest] = 1
def might_contain(self, item: str) -> bool:
for i in range(self.num_hashes):
digest = mmh3.hash(item, i) % self.size
if not self.bits[digest]:
return False
return True
# Usage: 1 billion URLs need ~1.8 GB Bloom filter with 1% false positive rate
bloom = BloomFilter(size=10_000_000, num_hashes=7)
if not bloom.might_contain(url):
bloom.add(url)
frontier.add_url(url)Page Parsing and Storage
from bs4 import BeautifulSoup
def parse_page(html: str, base_url: str) -> tuple:
"""Return (text_content, links, metadata)."""
soup = BeautifulSoup(html, 'html.parser')
# Extract text (remove HTML tags)
text = soup.get_text(separator=' ', strip=True)
# Extract links
links = []
for a_tag in soup.find_all('a', href=True):
href = urljoin(base_url, a_tag['href'])
if href.startswith('http'):
links.append(href)
# Metadata
title = soup.title.string if soup.title else ''
meta_desc = ''
meta_tag = soup.find('meta', attrs={'name': 'description'})
if meta_tag:
meta_desc = meta_tag.get('content', '')
return text, links, {'title': title, 'description': meta_desc}Common Mistakes
1. Not Respecting robots.txt
Ignoring robots.txt gets your IP banned and is unethical. Always fetch and parse robots.txt before crawling a domain.
2. Crawling Too Aggressively
Sending 100 requests/second to a small blog crashes it. Always rate limit per host. A good default is 1 request every 1-2 seconds.
3. Not Handling Redirects
HTTP 301/302 redirects are common. Follow them (up to 5 hops), but detect redirect loops with a visited set.
4. Storing Raw HTML Without Compression
Raw HTML is large. Compress with gzip before storing. Extract and index only the text content for search.
5. Using BFS Without Prioritization
Not all pages are equal. Prioritize pages from high-authority domains, recently updated pages, and pages linked from many sources.
6. No Checkpointing
If the crawler crashes mid-crawl without saving frontier state, you lose all progress. Serialize the frontier to disk every N URLs.
7. Crawling Duplicate Content
Identical pages on different URLs (URL parameters, session IDs) waste resources. Normalize URLs (remove ?utm_*, trailing slashes, #fragments).
Practice Questions
1. Why does a web crawler use BFS instead of DFS?
BFS discovers pages by distance from seeds, ensuring important pages are found first. DFS could get lost in a deep site structure.
2. What is the URL frontier and why is it important?
It’s the queue of URLs to crawl, organized for priority and politeness. Without it, a crawler has no systematic exploration strategy.
3. How does a Bloom filter help with deduplication?
It provides memory-efficient URL existence checking. A 10M-bit Bloom filter handles a billion URLs with ~1% false positive rate.
4. What is politeness policy and why is it necessary?
It limits request rate per host to avoid overloading servers. Without it, crawlers get blocked and disrupt websites.
5. Challenge: Implement priority crawling.
Score URLs by domain authority + freshness + link depth. Use a priority queue (heapq) instead of a simple deque. High-priority URLs are fetched first.
Mini Project: Focused Web Crawler
Build a focused crawler that:
- Starts from 5 seed URLs
- Uses BFS with politeness (1 req/sec per host)
- Extracts all internal and external links
- Filters links to a single domain (e.g.,
*.wikipedia.org) - Reports: total pages, average page size, depth distribution
- Saves a sitemap JSON file at the end
What’s Next
Congratulations on completing this Web Crawler design! Here’s where to go from here:
- Practice daily — Design one crawler subsystem per day
- Build a project — Implement a working focused crawler
- Explore related topics — Distributed crawling, page rank, search indexing
- Join the community — Share your crawler designs and get feedback
Remember: every expert was once a beginner. Keep crawling!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro