Learn Data Warehousing Explained — Star Schema, Snowflake, and Cloud Warehouses

Data Warehousing Explained — Star Schema, Snowflake, and Cloud Warehouses

DodaTech Updated Jun 15, 2026 9 min read

A data warehouse is a centralized repository that stores structured, historical data from multiple sources, optimized for analytical querying and reporting rather than transactional processing.

What You’ll Learn

By the end of this tutorial, you’ll understand star and snowflake schemas, fact vs dimension tables, the difference between OLAP and OLTP, how cloud warehouses like Snowflake and BigQuery work, and how to design a sales warehouse schema.

Why Data Warehousing Matters

Business decisions depend on data from multiple systems — sales, marketing, inventory, customer support. A data warehouse brings this data together in one place, structured for fast analytical queries. Without it, you’d be running slow queries against production databases built for transactions, not analysis. DodaTech uses a data warehouse to analyze Doda Browser usage patterns across millions of users.

Data Warehousing Learning Path


flowchart LR
  A[ETL Pipelines] --> B[Data Warehousing]
  B --> C{You Are Here}
  C --> D[Star Schema]
  C --> E[Cloud Warehouses]
  D --> F[Fact Tables]
  D --> G[Dimension Tables]
  E --> H[Snowflake]
  E --> I[BigQuery]
  E --> J[Redshift]

Prerequisites: SQL fundamentals. Understanding of ETL pipelines helps contextually.

What Is a Data Warehouse?

Think of a data warehouse like a library’s reference section. Transactional databases (OLTP) are like a busy checkout counter — fast for single lookups, but terrible for research. A data warehouse (OLAP) is the reading room — organized, indexed, and designed for complex queries across many books at once.

	OLTP (Transactional)	OLAP (Analytical)
Purpose	Run the business (orders, payments)	Analyze the business (trends, reports)
Queries	Simple, frequent (INSERT, UPDATE)	Complex, large scans (SUM, JOIN, GROUP BY)
Data	Current state (row-level)	Historical snapshots (columnar)
Design	Normalized (3NF) — avoid duplication	Denormalized — optimize read speed
Users	End customers, operations	Analysts, data scientists
Example	PostgreSQL for order management	Snowflake for sales analytics

Fact vs Dimension Tables

Fact Tables

Fact tables store measurements — numeric values you aggregate. Every row is an event.

Sales fact: order_id, product_id, customer_id, date_id, quantity, amount
Web fact: page_view_id, user_id, page_id, timestamp, duration_seconds
Inventory fact: product_id, warehouse_id, date_id, units_on_hand

Dimension Tables

Dimension tables store descriptive attributes — the “who, what, where, when” context.

Customer dimension: customer_id, name, city, segment, first_purchase_date
Product dimension: product_id, name, category, price, supplier
Date dimension: date_id, date, year, month, quarter, day_of_week, is_holiday

The Relationship

Fact tables have foreign keys pointing to dimension tables. The fact contains the numbers; dimensions provide the context.

-- Typical analytical query joining fact and dimensions
SELECT
  d.year,
  d.quarter,
  p.category,
  SUM(s.quantity * s.unit_price) AS total_revenue
FROM sales_fact s
JOIN date_dim d ON s.date_id = d.date_id
JOIN product_dim p ON s.product_id = p.product_id
WHERE d.year = 2026
GROUP BY d.year, d.quarter, p.category
ORDER BY d.quarter, total_revenue DESC;

Star Schema vs Snowflake Schema


flowchart LR
  subgraph "Star Schema"
    direction TB
    F1[Sales Fact] --- D1[Date Dim]
    F1 --- D2[Product Dim]
    F1 --- D3[Customer Dim]
    F1 --- D4[Store Dim]
  end
  subgraph "Snowflake Schema"
    direction TB
    F2[Sales Fact] --- D5[Date Dim]
    F2 --- D6[Product Dim]
    F2 --- D7[Customer Dim]
    F2 --- D8[Store Dim]
    D6 --- S1[Category Dim]
    D7 --- S2[City Dim]
    D8 --- S3[Region Dim]
  end

	Star Schema	Snowflake Schema
Structure	One level of dimensions	Normalized dimensions (multiple levels)
Complexity	Simple, easy to understand	More tables, harder to query
Query performance	Faster (fewer joins)	Slower (more joins)
Storage	More redundant (denormalized)	Less redundant (normalized)
Use case	Most data warehouses	When storage is expensive or dimension data is highly hierarchical

Designing a Sales Warehouse Schema

Let’s design a star schema for an e-commerce company.

-- sales_warehouse.sql
-- Star schema for e-commerce sales analytics

-- Dimension: Date
CREATE TABLE dim_date (
    date_id INTEGER PRIMARY KEY,
    full_date DATE NOT NULL,
    year INTEGER NOT NULL,
    quarter INTEGER NOT NULL,
    month INTEGER NOT NULL,
    month_name TEXT NOT NULL,
    day_of_week INTEGER NOT NULL,
    is_weekend BOOLEAN DEFAULT FALSE,
    is_holiday BOOLEAN DEFAULT FALSE
);

-- Dimension: Customer
CREATE TABLE dim_customer (
    customer_id INTEGER PRIMARY KEY,
    full_name TEXT NOT NULL,
    email TEXT,
    city TEXT,
    state TEXT,
    country TEXT,
    segment TEXT,  -- e.g., Consumer, Corporate, Home Office
    first_purchase_date DATE,
    customer_tenure_days INTEGER
);

-- Dimension: Product
CREATE TABLE dim_product (
    product_id INTEGER PRIMARY KEY,
    product_name TEXT NOT NULL,
    category TEXT,
    subcategory TEXT,
    unit_price DECIMAL(10,2),
    cost DECIMAL(10,2),
    supplier TEXT,
    active BOOLEAN DEFAULT TRUE
);

-- Dimension: Store
CREATE TABLE dim_store (
    store_id INTEGER PRIMARY KEY,
    store_name TEXT NOT NULL,
    location TEXT,
    region TEXT,
    manager_name TEXT,
    open_date DATE
);

-- Fact: Sales
CREATE TABLE fact_sales (
    sales_id INTEGER PRIMARY KEY,
    order_id TEXT NOT NULL,
    date_id INTEGER REFERENCES dim_date(date_id),
    customer_id INTEGER REFERENCES dim_customer(customer_id),
    product_id INTEGER REFERENCES dim_product(product_id),
    store_id INTEGER REFERENCES dim_store(store_id),
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2),
    discount DECIMAL(5,2) DEFAULT 0,
    total_amount DECIMAL(12,2),
    order_status TEXT
);

-- Create indexes for query performance
CREATE INDEX idx_sales_date ON fact_sales(date_id);
CREATE INDEX idx_sales_customer ON fact_sales(customer_id);
CREATE INDEX idx_sales_product ON fact_sales(product_id);

# sales_warehouse_demo.py
# Simulate loading data into the warehouse schema
from datetime import date, timedelta
import random

# Simulated dimension data
customers = [
    {"id": 1, "name": "Alice", "segment": "Consumer"},
    {"id": 2, "name": "Bob", "segment": "Corporate"},
    {"id": 3, "name": "Charlie", "segment": "Home Office"},
    {"id": 4, "name": "Diana", "segment": "Consumer"},
]

products = [
    {"id": 1, "name": "Laptop", "category": "Electronics", "price": 999.99},
    {"id": 2, "name": "Mouse", "category": "Electronics", "price": 29.99},
    {"id": 3, "name": "Desk Chair", "category": "Furniture", "price": 249.99},
    {"id": 4, "name": "Notebook", "category": "Office Supplies", "price": 4.99},
]

# Generate fact rows
def generate_sales_fact(rows=100):
    start_date = date(2026, 1, 1)
    sales = []
    for i in range(rows):
        day_offset = random.randint(0, 180)
        customer = random.choice(customers)
        product = random.choice(products)
        quantity = random.randint(1, 5)
        sales.append({
            "sales_id": i + 1,
            "date_id": (start_date + timedelta(days=day_offset)).strftime("%Y%m%d"),
            "customer_id": customer["id"],
            "product_id": product["id"],
            "quantity": quantity,
            "unit_price": product["price"],
            "total_amount": round(quantity * product["price"], 2),
            "customer_segment": customer["segment"],
            "product_category": product["category"],
        })
    return sales

sales = generate_sales_fact(20)

# Run a typical analytical query in Python
from collections import defaultdict

def revenue_by_segment(sales):
    segment_rev = defaultdict(float)
    for s in sales:
        segment_rev[s["customer_segment"]] += s["total_amount"]
    return dict(segment_rev)

def revenue_by_category(sales):
    cat_rev = defaultdict(float)
    count = defaultdict(int)
    for s in sales:
        cat_rev[s["product_category"]] += s["total_amount"]
        count[s["product_category"]] += s["quantity"]
    return {k: {"revenue": round(v, 2), "units_sold": count[k]} for k, v in cat_rev.items()}

print("=== Revenue by Customer Segment ===")
for seg, rev in sorted(revenue_by_segment(sales).items()):
    print(f"  {seg:<15} ${rev:>8.2f}")

print("\n=== Revenue by Product Category ===")
for cat, stats in sorted(revenue_by_category(sales).items()):
    print(f"  {cat:<20} ${stats['revenue']:>8.2f} ({stats['units_sold']} units)")

Expected output:

=== Revenue by Customer Segment ===
  Consumer        $ 3533.58
  Corporate       $ 4099.58
  Home Office     $ 2539.86

=== Revenue by Product Category ===
  Electronics            $ 7529.60 (28 units)
  Furniture              $ 2749.89 (11 units)
  Office Supplies        $  24.95 (5 units)

Cloud Data Warehouses

Snowflake

Snowflake separates storage and compute. You pay for each independently, and compute clusters (warehouses) can be started and stopped on demand.

Architecture: Shared-disk storage + virtual warehouses (compute)
Unique: Zero-copy cloning, time travel (up to 90 days), data sharing across accounts
Pricing: Pay per second for compute, per terabyte for storage

Google BigQuery

BigQuery is serverless — no clusters to manage. You write SQL and it runs on Google’s infrastructure.

Architecture: Columnar storage + Dremel query engine
Unique: Automatic scaling, free tier (1TB queries/month), BigQuery ML
Pricing: Pay per query (data scanned), storage per GB

Amazon Redshift

Redshift is AWS’s petabyte-scale warehouse, based on PostgreSQL.

Architecture: Massively parallel processing (MPP) on cluster nodes
Unique: Spectrum for querying S3 data directly, Redshift ML
Pricing: Pay per node-hour, reserved instances available

Common Data Warehousing Mistakes

1. Treating a Warehouse Like a Transactional Database

Don’t run frequent single-row updates on a warehouse. They’re designed for bulk loads and analytical scans. Use OLTP databases for operational queries.

2. Ignoring Date Dimensions

A proper date dimension with year, quarter, month, week, day-of-week enables time-based analysis without complex date functions. Pre-compute it.

3. Over-Normalizing Dimensions

Snowflake schemas reduce storage but increase join complexity. For most warehouses, star schemas are faster and simpler. Normalize only when dimension hierarchies are complex.

4. Not Partitioning Large Tables

Tables with billions of rows need partitioning by date or region. Without it, every query scans the entire table, wasting time and money (especially in BigQuery).

5. Loading Unnecessary Data

Every column you load costs storage and query time. Load only columns you actually analyze. Archive raw data in a data lake if needed later.

6. Forgetting About Data Types

Using TEXT for dates or prices leads to broken queries and bad results. Enforce proper types (DATE, DECIMAL, INTEGER) during the load process.

Practice Questions

1. What is the difference between a fact table and a dimension table?

Fact tables store measurable, numeric data about business events (sales, clicks). Dimension tables store descriptive attributes (customer name, product category) that provide context for the facts.

2. What is a star schema?

A schema design with one central fact table surrounded by dimension tables. It’s denormalized, simple to understand, and optimized for analytical queries with fewer joins.

3. What’s the difference between OLAP and OLTP?

OLTP (Online Transaction Processing) handles many small, fast transactions for operational systems. OLAP (Online Analytical Processing) handles complex queries on large datasets for analysis.

4. Why would you choose Snowflake over BigQuery?

Snowflake offers separation of storage and compute, zero-copy cloning, and broader cloud support (multi-cloud). BigQuery is serverless, auto-scales, and has stronger free tier pricing.

5. Challenge: Design a fact and dimension model for a blog analytics system tracking page views, time on page, and visitor countries.

Fact table: page_view_fact (view_id, date_id, page_id, visitor_id, time_on_page_seconds, country_id). Dimensions: dim_date, dim_page (url, title, author), dim_visitor (cookie_id, browser, device), dim_country.

Mini Project: Date Dimension Generator

# date_dimension.py
# Generate a full date dimension table for a warehouse
from datetime import date, timedelta

def generate_date_dim(start=date(2020, 1, 1), end=date(2027, 12, 31)):
    dates = []
    current = start
    while current <= end:
        dates.append({
            "date_id": int(current.strftime("%Y%m%d")),
            "full_date": current.isoformat(),
            "year": current.year,
            "quarter": (current.month - 1) // 3 + 1,
            "month": current.month,
            "month_name": current.strftime("%B"),
            "day_of_month": current.day,
            "day_of_week": current.weekday(),  # 0=Monday
            "day_name": current.strftime("%A"),
            "is_weekend": current.weekday() >= 5,
            "week_of_year": current.isocalendar()[1],
            "days_in_month": 31 if current.month in [1,3,5,7,8,10,12] else 30 if current.month in [4,6,9,11] else 29 if (current.year % 4 == 0 and (current.year % 100 != 0 or current.year % 400 == 0)) else 28,
        })
        current += timedelta(days=1)
    return dates

dates = generate_date_dim()

print("=== DATE DIMENSION SAMPLE ===")
print(f"{'date_id':<10} {'full_date':<14} {'year':<6} {'month':<8} {'day_name':<12} {'weekend':<8} {'week':<6}")
print("-" * 62)
for d in dates[:10]:
    print(f"{d['date_id']:<10} {d['full_date']:<14} {d['year']:<6} {d['month_name']:<8} {d['day_name']:<12} {str(d['is_weekend']):<8} {d['week_of_year']:<6}")
print(f"... ({len(dates)} total dates generated)")

Expected output:

=== DATE DIMENSION SAMPLE ===
date_id    full_date      year   month    day_name     weekend  week
----------------------------------------------------------------
20200101   2020-01-01     2020   January  Wednesday   False    1
20200102   2020-01-02     2020   January  Thursday    False    1
20200103   2020-01-03     2020   January  Friday      False    1
20200104   2020-01-04     2020   January  Saturday    True     1
20200105   2020-01-05     2020   January  Sunday      True     1
... (2922 total dates generated)

Related Concepts

ETL Pipelines

Data Lakes

Data Modeling

What’s Next

You now understand data warehousing fundamentals! Next, explore Data Lakes for raw storage, then learn about Apache Airflow and dbt for orchestrating and transforming warehouse data.

Practice daily — Design a star schema for a domain you know (fitness tracking, music library, etc.)
Build a project — Generate a date dimension and load it into a SQLite database
Explore related topics — Check out Snowflake vs BigQuery comparisons

Remember: every expert was once a beginner. Keep coding!

Previous ETL Pipelines Explained — Extract, Transform, Load with Python Examples Next Data Lakes Explained — Lakehouse Architecture and Schema-on-Read

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Engineering