Data Warehousing Explained — Star Schema, Snowflake, and Cloud Warehouses
A data warehouse is a centralized repository that stores structured, historical data from multiple sources, optimized for analytical querying and reporting rather than transactional processing.
What You’ll Learn
By the end of this tutorial, you’ll understand star and snowflake schemas, fact vs dimension tables, the difference between OLAP and OLTP, how cloud warehouses like Snowflake and BigQuery work, and how to design a sales warehouse schema.
Why Data Warehousing Matters
Business decisions depend on data from multiple systems — sales, marketing, inventory, customer support. A data warehouse brings this data together in one place, structured for fast analytical queries. Without it, you’d be running slow queries against production databases built for transactions, not analysis. DodaTech uses a data warehouse to analyze Doda Browser usage patterns across millions of users.
Data Warehousing Learning Path
flowchart LR
A[ETL Pipelines] --> B[Data Warehousing]
B --> C{You Are Here}
C --> D[Star Schema]
C --> E[Cloud Warehouses]
D --> F[Fact Tables]
D --> G[Dimension Tables]
E --> H[Snowflake]
E --> I[BigQuery]
E --> J[Redshift]
What Is a Data Warehouse?
Think of a data warehouse like a library’s reference section. Transactional databases (OLTP) are like a busy checkout counter — fast for single lookups, but terrible for research. A data warehouse (OLAP) is the reading room — organized, indexed, and designed for complex queries across many books at once.
| OLTP (Transactional) | OLAP (Analytical) | |
|---|---|---|
| Purpose | Run the business (orders, payments) | Analyze the business (trends, reports) |
| Queries | Simple, frequent (INSERT, UPDATE) | Complex, large scans (SUM, JOIN, GROUP BY) |
| Data | Current state (row-level) | Historical snapshots (columnar) |
| Design | Normalized (3NF) — avoid duplication | Denormalized — optimize read speed |
| Users | End customers, operations | Analysts, data scientists |
| Example | PostgreSQL for order management | Snowflake for sales analytics |
Fact vs Dimension Tables
Fact Tables
Fact tables store measurements — numeric values you aggregate. Every row is an event.
- Sales fact:
order_id, product_id, customer_id, date_id, quantity, amount - Web fact:
page_view_id, user_id, page_id, timestamp, duration_seconds - Inventory fact:
product_id, warehouse_id, date_id, units_on_hand
Dimension Tables
Dimension tables store descriptive attributes — the “who, what, where, when” context.
- Customer dimension:
customer_id, name, city, segment, first_purchase_date - Product dimension:
product_id, name, category, price, supplier - Date dimension:
date_id, date, year, month, quarter, day_of_week, is_holiday
The Relationship
Fact tables have foreign keys pointing to dimension tables. The fact contains the numbers; dimensions provide the context.
-- Typical analytical query joining fact and dimensions
SELECT
d.year,
d.quarter,
p.category,
SUM(s.quantity * s.unit_price) AS total_revenue
FROM sales_fact s
JOIN date_dim d ON s.date_id = d.date_id
JOIN product_dim p ON s.product_id = p.product_id
WHERE d.year = 2026
GROUP BY d.year, d.quarter, p.category
ORDER BY d.quarter, total_revenue DESC;Star Schema vs Snowflake Schema
flowchart LR
subgraph "Star Schema"
direction TB
F1[Sales Fact] --- D1[Date Dim]
F1 --- D2[Product Dim]
F1 --- D3[Customer Dim]
F1 --- D4[Store Dim]
end
subgraph "Snowflake Schema"
direction TB
F2[Sales Fact] --- D5[Date Dim]
F2 --- D6[Product Dim]
F2 --- D7[Customer Dim]
F2 --- D8[Store Dim]
D6 --- S1[Category Dim]
D7 --- S2[City Dim]
D8 --- S3[Region Dim]
end
| Star Schema | Snowflake Schema | |
|---|---|---|
| Structure | One level of dimensions | Normalized dimensions (multiple levels) |
| Complexity | Simple, easy to understand | More tables, harder to query |
| Query performance | Faster (fewer joins) | Slower (more joins) |
| Storage | More redundant (denormalized) | Less redundant (normalized) |
| Use case | Most data warehouses | When storage is expensive or dimension data is highly hierarchical |
Designing a Sales Warehouse Schema
Let’s design a star schema for an e-commerce company.
-- sales_warehouse.sql
-- Star schema for e-commerce sales analytics
-- Dimension: Date
CREATE TABLE dim_date (
date_id INTEGER PRIMARY KEY,
full_date DATE NOT NULL,
year INTEGER NOT NULL,
quarter INTEGER NOT NULL,
month INTEGER NOT NULL,
month_name TEXT NOT NULL,
day_of_week INTEGER NOT NULL,
is_weekend BOOLEAN DEFAULT FALSE,
is_holiday BOOLEAN DEFAULT FALSE
);
-- Dimension: Customer
CREATE TABLE dim_customer (
customer_id INTEGER PRIMARY KEY,
full_name TEXT NOT NULL,
email TEXT,
city TEXT,
state TEXT,
country TEXT,
segment TEXT, -- e.g., Consumer, Corporate, Home Office
first_purchase_date DATE,
customer_tenure_days INTEGER
);
-- Dimension: Product
CREATE TABLE dim_product (
product_id INTEGER PRIMARY KEY,
product_name TEXT NOT NULL,
category TEXT,
subcategory TEXT,
unit_price DECIMAL(10,2),
cost DECIMAL(10,2),
supplier TEXT,
active BOOLEAN DEFAULT TRUE
);
-- Dimension: Store
CREATE TABLE dim_store (
store_id INTEGER PRIMARY KEY,
store_name TEXT NOT NULL,
location TEXT,
region TEXT,
manager_name TEXT,
open_date DATE
);
-- Fact: Sales
CREATE TABLE fact_sales (
sales_id INTEGER PRIMARY KEY,
order_id TEXT NOT NULL,
date_id INTEGER REFERENCES dim_date(date_id),
customer_id INTEGER REFERENCES dim_customer(customer_id),
product_id INTEGER REFERENCES dim_product(product_id),
store_id INTEGER REFERENCES dim_store(store_id),
quantity INTEGER NOT NULL,
unit_price DECIMAL(10,2),
discount DECIMAL(5,2) DEFAULT 0,
total_amount DECIMAL(12,2),
order_status TEXT
);
-- Create indexes for query performance
CREATE INDEX idx_sales_date ON fact_sales(date_id);
CREATE INDEX idx_sales_customer ON fact_sales(customer_id);
CREATE INDEX idx_sales_product ON fact_sales(product_id);# sales_warehouse_demo.py
# Simulate loading data into the warehouse schema
from datetime import date, timedelta
import random
# Simulated dimension data
customers = [
{"id": 1, "name": "Alice", "segment": "Consumer"},
{"id": 2, "name": "Bob", "segment": "Corporate"},
{"id": 3, "name": "Charlie", "segment": "Home Office"},
{"id": 4, "name": "Diana", "segment": "Consumer"},
]
products = [
{"id": 1, "name": "Laptop", "category": "Electronics", "price": 999.99},
{"id": 2, "name": "Mouse", "category": "Electronics", "price": 29.99},
{"id": 3, "name": "Desk Chair", "category": "Furniture", "price": 249.99},
{"id": 4, "name": "Notebook", "category": "Office Supplies", "price": 4.99},
]
# Generate fact rows
def generate_sales_fact(rows=100):
start_date = date(2026, 1, 1)
sales = []
for i in range(rows):
day_offset = random.randint(0, 180)
customer = random.choice(customers)
product = random.choice(products)
quantity = random.randint(1, 5)
sales.append({
"sales_id": i + 1,
"date_id": (start_date + timedelta(days=day_offset)).strftime("%Y%m%d"),
"customer_id": customer["id"],
"product_id": product["id"],
"quantity": quantity,
"unit_price": product["price"],
"total_amount": round(quantity * product["price"], 2),
"customer_segment": customer["segment"],
"product_category": product["category"],
})
return sales
sales = generate_sales_fact(20)
# Run a typical analytical query in Python
from collections import defaultdict
def revenue_by_segment(sales):
segment_rev = defaultdict(float)
for s in sales:
segment_rev[s["customer_segment"]] += s["total_amount"]
return dict(segment_rev)
def revenue_by_category(sales):
cat_rev = defaultdict(float)
count = defaultdict(int)
for s in sales:
cat_rev[s["product_category"]] += s["total_amount"]
count[s["product_category"]] += s["quantity"]
return {k: {"revenue": round(v, 2), "units_sold": count[k]} for k, v in cat_rev.items()}
print("=== Revenue by Customer Segment ===")
for seg, rev in sorted(revenue_by_segment(sales).items()):
print(f" {seg:<15} ${rev:>8.2f}")
print("\n=== Revenue by Product Category ===")
for cat, stats in sorted(revenue_by_category(sales).items()):
print(f" {cat:<20} ${stats['revenue']:>8.2f} ({stats['units_sold']} units)")Expected output:
=== Revenue by Customer Segment ===
Consumer $ 3533.58
Corporate $ 4099.58
Home Office $ 2539.86
=== Revenue by Product Category ===
Electronics $ 7529.60 (28 units)
Furniture $ 2749.89 (11 units)
Office Supplies $ 24.95 (5 units)Cloud Data Warehouses
Snowflake
Snowflake separates storage and compute. You pay for each independently, and compute clusters (warehouses) can be started and stopped on demand.
- Architecture: Shared-disk storage + virtual warehouses (compute)
- Unique: Zero-copy cloning, time travel (up to 90 days), data sharing across accounts
- Pricing: Pay per second for compute, per terabyte for storage
Google BigQuery
BigQuery is serverless — no clusters to manage. You write SQL and it runs on Google’s infrastructure.
- Architecture: Columnar storage + Dremel query engine
- Unique: Automatic scaling, free tier (1TB queries/month), BigQuery ML
- Pricing: Pay per query (data scanned), storage per GB
Amazon Redshift
Redshift is AWS’s petabyte-scale warehouse, based on PostgreSQL.
- Architecture: Massively parallel processing (MPP) on cluster nodes
- Unique: Spectrum for querying S3 data directly, Redshift ML
- Pricing: Pay per node-hour, reserved instances available
Common Data Warehousing Mistakes
1. Treating a Warehouse Like a Transactional Database
Don’t run frequent single-row updates on a warehouse. They’re designed for bulk loads and analytical scans. Use OLTP databases for operational queries.
2. Ignoring Date Dimensions
A proper date dimension with year, quarter, month, week, day-of-week enables time-based analysis without complex date functions. Pre-compute it.
3. Over-Normalizing Dimensions
Snowflake schemas reduce storage but increase join complexity. For most warehouses, star schemas are faster and simpler. Normalize only when dimension hierarchies are complex.
4. Not Partitioning Large Tables
Tables with billions of rows need partitioning by date or region. Without it, every query scans the entire table, wasting time and money (especially in BigQuery).
5. Loading Unnecessary Data
Every column you load costs storage and query time. Load only columns you actually analyze. Archive raw data in a data lake if needed later.
6. Forgetting About Data Types
Using TEXT for dates or prices leads to broken queries and bad results. Enforce proper types (DATE, DECIMAL, INTEGER) during the load process.
Practice Questions
1. What is the difference between a fact table and a dimension table?
Fact tables store measurable, numeric data about business events (sales, clicks). Dimension tables store descriptive attributes (customer name, product category) that provide context for the facts.
2. What is a star schema?
A schema design with one central fact table surrounded by dimension tables. It’s denormalized, simple to understand, and optimized for analytical queries with fewer joins.
3. What’s the difference between OLAP and OLTP?
OLTP (Online Transaction Processing) handles many small, fast transactions for operational systems. OLAP (Online Analytical Processing) handles complex queries on large datasets for analysis.
4. Why would you choose Snowflake over BigQuery?
Snowflake offers separation of storage and compute, zero-copy cloning, and broader cloud support (multi-cloud). BigQuery is serverless, auto-scales, and has stronger free tier pricing.
5. Challenge: Design a fact and dimension model for a blog analytics system tracking page views, time on page, and visitor countries.
Fact table: page_view_fact (view_id, date_id, page_id, visitor_id, time_on_page_seconds, country_id). Dimensions: dim_date, dim_page (url, title, author), dim_visitor (cookie_id, browser, device), dim_country.
Mini Project: Date Dimension Generator
# date_dimension.py
# Generate a full date dimension table for a warehouse
from datetime import date, timedelta
def generate_date_dim(start=date(2020, 1, 1), end=date(2027, 12, 31)):
dates = []
current = start
while current <= end:
dates.append({
"date_id": int(current.strftime("%Y%m%d")),
"full_date": current.isoformat(),
"year": current.year,
"quarter": (current.month - 1) // 3 + 1,
"month": current.month,
"month_name": current.strftime("%B"),
"day_of_month": current.day,
"day_of_week": current.weekday(), # 0=Monday
"day_name": current.strftime("%A"),
"is_weekend": current.weekday() >= 5,
"week_of_year": current.isocalendar()[1],
"days_in_month": 31 if current.month in [1,3,5,7,8,10,12] else 30 if current.month in [4,6,9,11] else 29 if (current.year % 4 == 0 and (current.year % 100 != 0 or current.year % 400 == 0)) else 28,
})
current += timedelta(days=1)
return dates
dates = generate_date_dim()
print("=== DATE DIMENSION SAMPLE ===")
print(f"{'date_id':<10} {'full_date':<14} {'year':<6} {'month':<8} {'day_name':<12} {'weekend':<8} {'week':<6}")
print("-" * 62)
for d in dates[:10]:
print(f"{d['date_id']:<10} {d['full_date']:<14} {d['year']:<6} {d['month_name']:<8} {d['day_name']:<12} {str(d['is_weekend']):<8} {d['week_of_year']:<6}")
print(f"... ({len(dates)} total dates generated)")Expected output:
=== DATE DIMENSION SAMPLE ===
date_id full_date year month day_name weekend week
----------------------------------------------------------------
20200101 2020-01-01 2020 January Wednesday False 1
20200102 2020-01-02 2020 January Thursday False 1
20200103 2020-01-03 2020 January Friday False 1
20200104 2020-01-04 2020 January Saturday True 1
20200105 2020-01-05 2020 January Sunday True 1
... (2922 total dates generated)Related Concepts
What’s Next
You now understand data warehousing fundamentals! Next, explore Data Lakes for raw storage, then learn about Apache Airflow and dbt for orchestrating and transforming warehouse data.
- Practice daily — Design a star schema for a domain you know (fitness tracking, music library, etc.)
- Build a project — Generate a date dimension and load it into a SQLite database
- Explore related topics — Check out Snowflake vs BigQuery comparisons
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro