Learn Build a Web Scraper with Python (Step by Step)

Build a Web Scraper with Python (Step by Step)

DodaTech Updated Jun 19, 2026 8 min read

Build a web scraper with Python that extracts product data from an e-commerce site, handles pagination, respects robots.txt rules, and saves the results to CSV and JSON formats.

What You’ll Build

You’ll build scraper-cli, a Python tool that scrapes book data from Books to Scrape (a sandbox site built for learning web scraping). It extracts title, price, rating, availability, and category from every page, handles pagination automatically, respects rate limits, and outputs clean CSVs and JSON files. At DodaTech, similar scraping patterns power Durga Antivirus Pro’s threat intelligence feed collection.

Why Web Scraping Matters

Not every website has an API. When you need data from a site that doesn’t offer one — product prices, news articles, job listings, real estate listings — web scraping is the answer. It’s used in competitive analysis, research, data science, and content aggregation. Combined with ethical practices (rate limiting, respecting robots.txt), it’s a powerful and legal tool.

Prerequisites

Python 3.8+ installed
Basic HTML knowledge to understand page structure
Familiarity with JSON files and CSV

Step 1: Setup

mkdir scraper-cli
cd scraper-cli
python -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4 lxml

Project structure:

scraper-cli/
├── scraper.py       # Core scraping logic
├── utils.py         # Helpers (rate limiting, file output)
└── output/          # Scraped data goes here

Step 2: The Base Scraper

# utils.py
import time
import json
import csv
from pathlib import Path
from urllib.parse import urlparse
from typing import List, Dict

class RateLimiter:
    """Simple rate limiter — waits between requests."""
    def __init__(self, delay: float = 1.0):
        self.delay = delay
        self.last_request = 0

    def wait(self):
        elapsed = time.time() - self.last_request
        if elapsed < self.delay:
            time.sleep(self.delay - elapsed)
        self.last_request = time.time()

def save_to_csv(data: List[Dict], filename: str):
    """Save scraped data to CSV file."""
    if not data:
        print("No data to save")
        return
    path = Path("output") / filename
    path.parent.mkdir(exist_ok=True)
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    print(f"Saved {len(data)} rows to {path}")

def save_to_json(data: List[Dict], filename: str):
    """Save scraped data to JSON file."""
    path = Path("output") / filename
    path.parent.mkdir(exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    print(f"Saved {len(data)} items to {path}")

def is_allowed(url: str, user_agent: str = "Mozilla/5.0") -> bool:
    """Check robots.txt — returns True if scraping is allowed."""
    from urllib.robotparser import RobotFileParser
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    rp = RobotFileParser()
    rp.set_url(robots_url)
    try:
        rp.read()
        return rp.can_fetch(user_agent, url)
    except Exception:
        print(f"Could not read robots.txt: {robots_url}")
        return True  # Default to allowed if robots.txt is unreachable

Step 3: The Scraper

# scraper.py
import requests
from bs4 import BeautifulSoup
from utils import RateLimiter, save_to_csv, save_to_json, is_allowed

BASE_URL = "https://books.toscrape.com"

class BookScraper:
    def __init__(self, delay: float = 1.0):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })
        self.limiter = RateLimiter(delay)
        self.books = []

    def scrape_all(self):
        """Scrape all pages of the catalogue."""
        page = 1
        while True:
            url = f"{BASE_URL}/catalogue/page-{page}.html"
            self.limiter.wait()

            print(f"Scraping page {page}...")
            books_on_page = self.scrape_page(url)

            if not books_on_page:
                print(f"No books found on page {page} — end of catalogue")
                break

            self.books.extend(books_on_page)
            page += 1

        print(f"Scraped {len(self.books)} books total")
        return self.books

    def scrape_page(self, url: str) -> list:
        """Scrape a single catalogue page."""
        if not is_allowed(url):
            print(f"Blocked by robots.txt: {url}")
            return []

        response = self.session.get(url)
        if response.status_code == 404:
            return []
        response.raise_for_status()

        soup = BeautifulSoup(response.text, "lxml")
        books = []

        for article in soup.select("article.product_pod"):
            book = {
                "title": self._get_title(article),
                "price": self._get_price(article),
                "rating": self._get_rating(article),
                "availability": self._get_availability(article),
                "url": self._get_url(article),
            }
            books.append(book)

        return books

    def _get_title(self, article) -> str:
        title_tag = article.select_one("h3 a")
        return title_tag.get("title", title_tag.text.strip()) if title_tag else ""

    def _get_price(self, article) -> float:
        price_tag = article.select_one("p.price_color")
        if price_tag:
            price_str = price_tag.text.strip().replace("£", "").replace("Â", "")
            return float(price_str)
        return 0.0

    def _get_rating(self, article) -> int:
        rating_tag = article.select_one("p.star-rating")
        if rating_tag:
            classes = rating_tag.get("class", [])
            rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
            for cls in classes:
                if cls in rating_map:
                    return rating_map[cls]
        return 0

    def _get_availability(self, article) -> str:
        avail_tag = article.select_one("p.instock.availability")
        return avail_tag.text.strip() if avail_tag else "Unknown"

    def _get_url(self, article) -> str:
        url_tag = article.select_one("h3 a")
        if url_tag and url_tag.get("href"):
            return BASE_URL + "/catalogue/" + url_tag["href"]
        return ""

    def scrape_book_detail(self, url: str) -> dict:
        """Scrape additional details from a book's individual page."""
        self.limiter.wait()
        response = self.session.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")

        description_tag = soup.select_one("#product_description ~ p")
        description = description_tag.text.strip() if description_tag else ""

        category_tag = soup.select_one(".breadcrumb li:nth-child(3) a")
        category = category_tag.text.strip() if category_tag else ""

        return {"description": description, "category": category}

    def close(self):
        self.session.close()

Step 4: Run the Scraper

# run.py
from scraper import BookScraper
from utils import save_to_csv, save_to_json

if __name__ == "__main__":
    scraper = BookScraper(delay=0.5)  # Be polite — 500ms between requests
    try:
        books = scraper.scrape_all()

        # Enrich with detail pages (first 10 only, for speed)
        for book in books[:10]:
            if book["url"]:
                details = scraper.scrape_book_detail(book["url"])
                book.update(details)
                print(f"  Got details for: {book['title'][:50]}...")

        save_to_csv(books, "books.csv")
        save_to_json(books, "books.json")

        print(f"\nDone! Scraped {len(books)} books.")
        print(f"Sample: {books[0]['title']} — £{books[0]['price']}")

    finally:
        scraper.close()

Run it:

python run.py

Expected output (first few lines):

Scraping page 1...
Scraping page 2...
Scraping page 3...
...
Scraped 1000 books total
  Got details for: A Light in the Attic...
  Got details for: Tipping the Velvet...
Saved 1000 rows to output/books.csv
Saved 1000 items to output/books.json

Done! Scraped 1000 books.
Sample: A Light in the Attic — £51.77

Step 5: Verify the Output

# verify.py
import pandas as pd
df = pd.read_csv("output/books.csv")
print(df.head())
print(f"\nTotal books: {len(df)}")
print(f"Average price: £{df['price'].mean():.2f}")
print(f"Top rating: {df['rating'].max()} stars")
print(f"Categories: {df['category'].nunique()}")

Expected output:

                                             title   price  rating availability  ...  description  category
0                     A Light in the Attic  51.77       3  In stock  ...  It's hard to imagine...  Poetry
1                               Tipping the Velvet  53.74       1  In stock  ...                     Poetry
2                              Soumission  50.10       1  In stock  ...                     Poetry
...
Total books: 1000
Average price: £50.00
Top rating: 5 stars

Architecture


flowchart TD
    A[Start] --> B[Check robots.txt]
    B --> C{Allowed?}
    C -->|No| D[Skip URL]
    C -->|Yes| E[Send HTTP request]
    E --> F[Parse HTML with BeautifulSoup]
    F --> G[Extract: title, price, rating, url]
    G --> H[Save to list]
    H --> I{Next page?}
    I -->|Yes| J[Increment page counter]
    J --> E
    I -->|No| K[Save CSV + JSON]
    K --> L[Enrich with detail pages]
    L --> M[Done]

Common Errors

1. HTTP 403 Forbidden The server is blocking your scraper. Solutions: update the User-Agent header (requests default is python-requests/2.x, which some sites block), add a Referer header, or increase the delay between requests. Some sites require cookies — use requests.Session() to maintain them.

2. BeautifulSoup returns empty results The HTML structure might be different than expected. First, print response.text[:500] to verify you’re getting actual HTML (not a JSON API or error page). Then use the browser’s DevTools to inspect the real element classes — sites often change their markup. The classes in our scraper (product_pod, price_color) match Books to Scrape specifically.

3. Rate limiting / IP ban You’re scraping too fast. Increase the delay parameter (start with 2 seconds). Add random jitter: time.sleep(delay + random.uniform(0, 0.5)). For large projects, use rotating proxies or the timeout parameter to handle slow servers gracefully.

Practice Questions

1. What does response.raise_for_status() do? It raises requests.exceptions.HTTPError for 4xx or 5xx status codes. This stops execution immediately if the server returns an error, rather than silently proceeding with a broken response.

2. How does the scraper know when to stop paginating? When scrape_page() returns an empty list. This happens when the page returns a 404 (end of catalogue) or when no article.product_pod elements exist. This approach works for sequential pagination but fails for sites with infinite scroll.

3. Why do we check robots.txt? It’s both ethical and legally prudent. robots.txt tells crawlers which paths are off-limits. While not legally binding, ignoring it can lead to IP bans and, in some jurisdictions, legal liability under computer fraud laws.

4. Challenge: Add retry logic Wrap the HTTP request in a retry loop: try up to 3 times with exponential backoff (1s, 2s, 4s). Only retry on 5xx errors (server problems) and connection errors — not on 4xx (client errors). Use requests.adapters.HTTPAdapter with max_retries.

5. Challenge: Scrape with threads Use concurrent.futures.ThreadPoolExecutor to scrape multiple pages simultaneously. Be careful with rate limiting — use a shared threading.Lock() around the rate limiter. Compare performance: single-threaded vs 4-thread vs 8-thread.

FAQ

Is web scraping legal?

It depends on jurisdiction and the site’s terms of service. Scraping public data for personal use is generally legal. Commercial scraping of copyrighted content may violate terms of service. Always check robots.txt and the site’s ToS. Never bypass authentication or scrape personal data without consent.

How do I handle JavaScript-rendered content?

Use Selenium or Playwright instead of requests. These tools control a real browser (headless or visible) and can execute JavaScript. They’re slower but necessary for single-page apps and dynamic content. Install playwright and use sync_playwright().chromium.launch().

How do I avoid being blocked?

Rotate User-Agent strings across requests. Use proxies (residential proxies for large-scale scraping). Add random delays (1-5 seconds). Limit concurrent connections. Cache responses to avoid re-fetching. Respect Cache-Control headers.

Next Steps

Store scraped data in MongoDB instead of CSV
Learn data science workflows to analyze scraped datasets
Explore Selenium for JavaScript-heavy sites
Build a scheduling system with CI/CD to run scrapers automatically

Previous Build a CLI Tool in Python (Step-by-Step Tutorial) Next Build a URL Shortener (Like bit.ly) — Full-Stack Tutorial

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Build Projects