Build a Web Scraper with Python (Step by Step)
Build a web scraper with Python that extracts product data from an e-commerce site, handles pagination, respects robots.txt rules, and saves the results to CSV and JSON formats.
What You’ll Build
You’ll build scraper-cli, a Python tool that scrapes book data from Books to Scrape (a sandbox site built for learning web scraping). It extracts title, price, rating, availability, and category from every page, handles pagination automatically, respects rate limits, and outputs clean CSVs and JSON files. At DodaTech, similar scraping patterns power Durga Antivirus Pro’s threat intelligence feed collection.
Why Web Scraping Matters
Not every website has an API. When you need data from a site that doesn’t offer one — product prices, news articles, job listings, real estate listings — web scraping is the answer. It’s used in competitive analysis, research, data science, and content aggregation. Combined with ethical practices (rate limiting, respecting robots.txt), it’s a powerful and legal tool.
Prerequisites
- Python 3.8+ installed
- Basic HTML knowledge to understand page structure
- Familiarity with JSON files and CSV
Step 1: Setup
mkdir scraper-cli
cd scraper-cli
python -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4 lxmlProject structure:
scraper-cli/
├── scraper.py # Core scraping logic
├── utils.py # Helpers (rate limiting, file output)
└── output/ # Scraped data goes hereStep 2: The Base Scraper
# utils.py
import time
import json
import csv
from pathlib import Path
from urllib.parse import urlparse
from typing import List, Dict
class RateLimiter:
"""Simple rate limiter — waits between requests."""
def __init__(self, delay: float = 1.0):
self.delay = delay
self.last_request = 0
def wait(self):
elapsed = time.time() - self.last_request
if elapsed < self.delay:
time.sleep(self.delay - elapsed)
self.last_request = time.time()
def save_to_csv(data: List[Dict], filename: str):
"""Save scraped data to CSV file."""
if not data:
print("No data to save")
return
path = Path("output") / filename
path.parent.mkdir(exist_ok=True)
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
print(f"Saved {len(data)} rows to {path}")
def save_to_json(data: List[Dict], filename: str):
"""Save scraped data to JSON file."""
path = Path("output") / filename
path.parent.mkdir(exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"Saved {len(data)} items to {path}")
def is_allowed(url: str, user_agent: str = "Mozilla/5.0") -> bool:
"""Check robots.txt — returns True if scraping is allowed."""
from urllib.robotparser import RobotFileParser
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
return rp.can_fetch(user_agent, url)
except Exception:
print(f"Could not read robots.txt: {robots_url}")
return True # Default to allowed if robots.txt is unreachableStep 3: The Scraper
# scraper.py
import requests
from bs4 import BeautifulSoup
from utils import RateLimiter, save_to_csv, save_to_json, is_allowed
BASE_URL = "https://books.toscrape.com"
class BookScraper:
def __init__(self, delay: float = 1.0):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
self.limiter = RateLimiter(delay)
self.books = []
def scrape_all(self):
"""Scrape all pages of the catalogue."""
page = 1
while True:
url = f"{BASE_URL}/catalogue/page-{page}.html"
self.limiter.wait()
print(f"Scraping page {page}...")
books_on_page = self.scrape_page(url)
if not books_on_page:
print(f"No books found on page {page} — end of catalogue")
break
self.books.extend(books_on_page)
page += 1
print(f"Scraped {len(self.books)} books total")
return self.books
def scrape_page(self, url: str) -> list:
"""Scrape a single catalogue page."""
if not is_allowed(url):
print(f"Blocked by robots.txt: {url}")
return []
response = self.session.get(url)
if response.status_code == 404:
return []
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
books = []
for article in soup.select("article.product_pod"):
book = {
"title": self._get_title(article),
"price": self._get_price(article),
"rating": self._get_rating(article),
"availability": self._get_availability(article),
"url": self._get_url(article),
}
books.append(book)
return books
def _get_title(self, article) -> str:
title_tag = article.select_one("h3 a")
return title_tag.get("title", title_tag.text.strip()) if title_tag else ""
def _get_price(self, article) -> float:
price_tag = article.select_one("p.price_color")
if price_tag:
price_str = price_tag.text.strip().replace("£", "").replace("Â", "")
return float(price_str)
return 0.0
def _get_rating(self, article) -> int:
rating_tag = article.select_one("p.star-rating")
if rating_tag:
classes = rating_tag.get("class", [])
rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
for cls in classes:
if cls in rating_map:
return rating_map[cls]
return 0
def _get_availability(self, article) -> str:
avail_tag = article.select_one("p.instock.availability")
return avail_tag.text.strip() if avail_tag else "Unknown"
def _get_url(self, article) -> str:
url_tag = article.select_one("h3 a")
if url_tag and url_tag.get("href"):
return BASE_URL + "/catalogue/" + url_tag["href"]
return ""
def scrape_book_detail(self, url: str) -> dict:
"""Scrape additional details from a book's individual page."""
self.limiter.wait()
response = self.session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
description_tag = soup.select_one("#product_description ~ p")
description = description_tag.text.strip() if description_tag else ""
category_tag = soup.select_one(".breadcrumb li:nth-child(3) a")
category = category_tag.text.strip() if category_tag else ""
return {"description": description, "category": category}
def close(self):
self.session.close()Step 4: Run the Scraper
# run.py
from scraper import BookScraper
from utils import save_to_csv, save_to_json
if __name__ == "__main__":
scraper = BookScraper(delay=0.5) # Be polite — 500ms between requests
try:
books = scraper.scrape_all()
# Enrich with detail pages (first 10 only, for speed)
for book in books[:10]:
if book["url"]:
details = scraper.scrape_book_detail(book["url"])
book.update(details)
print(f" Got details for: {book['title'][:50]}...")
save_to_csv(books, "books.csv")
save_to_json(books, "books.json")
print(f"\nDone! Scraped {len(books)} books.")
print(f"Sample: {books[0]['title']} — £{books[0]['price']}")
finally:
scraper.close()Run it:
python run.pyExpected output (first few lines):
Scraping page 1...
Scraping page 2...
Scraping page 3...
...
Scraped 1000 books total
Got details for: A Light in the Attic...
Got details for: Tipping the Velvet...
Saved 1000 rows to output/books.csv
Saved 1000 items to output/books.json
Done! Scraped 1000 books.
Sample: A Light in the Attic — £51.77Step 5: Verify the Output
# verify.py
import pandas as pd
df = pd.read_csv("output/books.csv")
print(df.head())
print(f"\nTotal books: {len(df)}")
print(f"Average price: £{df['price'].mean():.2f}")
print(f"Top rating: {df['rating'].max()} stars")
print(f"Categories: {df['category'].nunique()}")Expected output:
title price rating availability ... description category
0 A Light in the Attic 51.77 3 In stock ... It's hard to imagine... Poetry
1 Tipping the Velvet 53.74 1 In stock ... Poetry
2 Soumission 50.10 1 In stock ... Poetry
...
Total books: 1000
Average price: £50.00
Top rating: 5 starsArchitecture
flowchart TD
A[Start] --> B[Check robots.txt]
B --> C{Allowed?}
C -->|No| D[Skip URL]
C -->|Yes| E[Send HTTP request]
E --> F[Parse HTML with BeautifulSoup]
F --> G[Extract: title, price, rating, url]
G --> H[Save to list]
H --> I{Next page?}
I -->|Yes| J[Increment page counter]
J --> E
I -->|No| K[Save CSV + JSON]
K --> L[Enrich with detail pages]
L --> M[Done]
Common Errors
1. HTTP 403 Forbidden
The server is blocking your scraper. Solutions: update the User-Agent header (requests default is python-requests/2.x, which some sites block), add a Referer header, or increase the delay between requests. Some sites require cookies — use requests.Session() to maintain them.
2. BeautifulSoup returns empty results
The HTML structure might be different than expected. First, print response.text[:500] to verify you’re getting actual HTML (not a JSON API or error page). Then use the browser’s DevTools to inspect the real element classes — sites often change their markup. The classes in our scraper (product_pod, price_color) match Books to Scrape specifically.
3. Rate limiting / IP ban
You’re scraping too fast. Increase the delay parameter (start with 2 seconds). Add random jitter: time.sleep(delay + random.uniform(0, 0.5)). For large projects, use rotating proxies or the timeout parameter to handle slow servers gracefully.
Practice Questions
1. What does response.raise_for_status() do?
It raises requests.exceptions.HTTPError for 4xx or 5xx status codes. This stops execution immediately if the server returns an error, rather than silently proceeding with a broken response.
2. How does the scraper know when to stop paginating?
When scrape_page() returns an empty list. This happens when the page returns a 404 (end of catalogue) or when no article.product_pod elements exist. This approach works for sequential pagination but fails for sites with infinite scroll.
3. Why do we check robots.txt?
It’s both ethical and legally prudent. robots.txt tells crawlers which paths are off-limits. While not legally binding, ignoring it can lead to IP bans and, in some jurisdictions, legal liability under computer fraud laws.
4. Challenge: Add retry logic
Wrap the HTTP request in a retry loop: try up to 3 times with exponential backoff (1s, 2s, 4s). Only retry on 5xx errors (server problems) and connection errors — not on 4xx (client errors). Use requests.adapters.HTTPAdapter with max_retries.
5. Challenge: Scrape with threads
Use concurrent.futures.ThreadPoolExecutor to scrape multiple pages simultaneously. Be careful with rate limiting — use a shared threading.Lock() around the rate limiter. Compare performance: single-threaded vs 4-thread vs 8-thread.
FAQ
Next Steps
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro