Modern Data Warehousing — Snowflake, BigQuery, Redshift Guide
Modern data warehousing decouples storage from compute, auto-scales infrastructure, and provides SQL analytics on petabyte-scale data without managing servers.
What You’ll Learn
By the end of this tutorial, you’ll understand how Snowflake, BigQuery, and Redshift work under the hood, how to design tables for each platform, and how to choose the right warehouse for your workload.
Why Modern Warehousing Matters
Traditional on-premise data warehouses required capacity planning, hardware procurement, and DBA teams. Modern cloud warehouses let you store petabytes and pay per query. Snowflake auto-suspends when idle, BigQuery processes terabytes in seconds, and Redshift Spectrum queries data lakes directly. Doda Browser uses Snowflake for its analytics platform, reducing query time by 80% and cutting storage costs by 60% compared to its legacy warehouse.
Data Warehousing Learning Path
flowchart LR
A[Big Data Overview] --> B[Data Warehousing]
B --> C{You Are Here}
C --> D[Snowflake]
C --> E[BigQuery]
C --> F[Redshift]
D --> G[Storage/Compute]
E --> H[Slots & Partitioning]
F --> I[Distribution & Sort]
What Is a Modern Data Warehouse?
Think of the old warehouse as a single giant refrigerator: cooling and food are in one box. If you want more capacity, you buy a bigger fridge (vertical scaling). A modern warehouse is like a restaurant kitchen with a walk-in cooler and separate cooking stations: storage (the cooler) scales independently from compute (the stoves). You can double the stoves without buying more cooler space.
The three major cloud warehouses implement this differently:
| Platform | Storage | Compute | Pricing Model |
|---|---|---|---|
| Snowflake | Compressed columnar (S3) | Virtual warehouses | Compute credits + storage |
| BigQuery | Colossus (Google FS) | Slot pool | Per-slot or on-demand |
| Redshift | RA3 managed storage | Node-based clusters | Node-hours + storage |
Snowflake Deep Dive
Snowflake’s architecture separates storage (compressed, columnar, in S3) from compute (virtual warehouses). You can have multiple virtual warehouses querying the same data without contention.
Virtual Warehouses
-- Create a virtual warehouse
CREATE WAREHOUSE analytics_wh
WITH WAREHOUSE_SIZE = 'MEDIUM' -- XSMALL, SMALL, MEDIUM, LARGE, XLARGE...
AUTO_SUSPEND = 300 -- Suspend after 5 min idle
AUTO_RESUME = TRUE -- Resume automatically on query
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 3
SCALING_POLICY = 'ECONOMY'; -- 'ECONOMY' or 'STANDARD'
-- Create tables
CREATE OR REPLACE TABLE orders (
order_id INTEGER AUTOINCREMENT,
customer_id INTEGER,
amount DECIMAL(10,2),
order_date DATE,
status VARCHAR(20)
)
CLUSTER BY (order_date);
INSERT INTO orders (customer_id, amount, order_date, status) VALUES
(1001, 250.00, '2026-06-01', 'completed'),
(1002, 45.50, '2026-06-01', 'pending'),
(1003, 1200.00, '2026-06-02', 'completed');
-- Query using the warehouse
USE WAREHOUSE analytics_wh;
SELECT status, COUNT(*), SUM(amount)
FROM orders
WHERE order_date >= '2026-06-01'
GROUP BY status;Expected output:
+-----------+----------+-------------+
| STATUS | COUNT(*) | SUM(AMOUNT) |
+-----------+----------+-------------+
| completed | 2 | 1450.00 |
| pending | 1 | 45.50 |
+-----------+----------+-------------+Clustering and Auto-Suspend
-- Snowflake clustering for large tables
ALTER TABLE orders CLUSTER BY (order_date);
-- Recluster manually
ALTER TABLE orders RECLUSTER;
-- Check clustering state
SELECT SYSTEM$CLUSTERING_INFORMATION('orders');
-- Auto-suspend: warehouse suspends after inactivity
-- SHOW WAREHOUSES will show "Suspended" state
SHOW WAREHOUSES LIKE 'analytics_wh';
-- Resume automatically on first query
SELECT COUNT(*) FROM orders; -- Warehouse auto-resumesTime Travel and Zero-Copy Cloning
-- Time travel: query data as it existed 1 hour ago
SELECT * FROM orders AT (TIMESTAMP => DATEADD(hour, -1, CURRENT_TIMESTAMP()));
-- Undrop a dropped table
DROP TABLE orders;
UNDROP TABLE orders;
-- Zero-copy clone (instant, no additional storage)
CREATE OR REPLACE TABLE orders_dev CLONE orders;BigQuery Deep Dive
BigQuery uses Google’s Colossus file system for storage and a slot-based compute model. Slots are units of computational capacity (one slot = one thread).
Partitioned and Clustered Tables
-- Partitioned table (by date)
CREATE OR REPLACE TABLE `project.dataset.orders`
PARTITION BY order_date
CLUSTER BY customer_id
AS
SELECT * FROM `project.dataset.raw_orders`;
-- Query against partitioned table (prunes partitions)
SELECT customer_id, SUM(amount) as total_spent
FROM `project.dataset.orders`
WHERE order_date BETWEEN '2026-06-01' AND '2026-06-30'
GROUP BY customer_id;
-- Information schema for partition metadata
SELECT table_name, partition_id, total_rows, total_logical_bytes
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`;Slots and On-Demand vs Flat-Rate
-- On-demand: unlimited slots, pay per byte processed
-- This query costs ~$0.005 per GB processed
SELECT COUNT(DISTINCT customer_id) FROM orders;
-- With clustering, BQ processes fewer bytes
SELECT COUNT(*) FROM orders WHERE order_date = '2026-06-01';
-- Clustered: scans only the partition + cluster block
-- Unclustered: scans entire table
-- Monitor slot usage
SELECT * FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE DATE(creation_time) = CURRENT_DATE();BigQuery Optimization Example
-- BAD: Full table scan
SELECT DATE(order_date) as day, SUM(amount) as revenue
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY day;
-- Processes all partitions from 2025
-- GOOD: Pruned with partitioning
SELECT DATE(order_date) as day, SUM(amount) as revenue
FROM orders
WHERE order_date BETWEEN '2026-06-01' AND '2026-06-30'
GROUP BY day;
-- Processes only June 2026 partition
-- MATERIALIZED VIEW for common aggregations
CREATE MATERIALIZED VIEW `project.dataset.daily_revenue` AS
SELECT order_date, SUM(amount) as revenue, COUNT(*) as orders_count
FROM orders
GROUP BY order_date;Redshift Deep Dive
Amazon Redshift uses a shared-nothing MPP architecture. Data is distributed across nodes, and queries run in parallel.
Distribution Styles
-- DISTSTYLE EVEN: rows distributed round-robin
-- Best for: staging tables, no joins
CREATE TABLE orders_even (
order_id INTEGER,
customer_id INTEGER,
amount DECIMAL(10,2)
) DISTSTYLE EVEN;
-- DISTSTYLE KEY: rows distributed by hash of key column
-- Best for: fact tables joined on the distribution key
CREATE TABLE orders_key (
order_id INTEGER,
customer_id INTEGER DISTKEY, -- Distribute by customer
amount DECIMAL(10,2)
) DISTSTYLE KEY;
-- DISTSTYLE ALL: full copy on every node
-- Best for: small dimension tables (< 1M rows)
CREATE TABLE customers (
customer_id INTEGER,
name VARCHAR(100),
city VARCHAR(50)
) DISTSTYLE ALL;Sort Keys
-- COMPOUND sort key: most selective column first
-- Best for: queries that filter on prefix of sort keys
CREATE TABLE orders_compound (
order_id INTEGER,
customer_id INTEGER,
amount DECIMAL(10,2),
order_date DATE,
status VARCHAR(20)
)
DISTSTYLE KEY DISTKEY (customer_id)
COMPOUND SORTKEY (order_date, status);
-- INTERLEAVED sort key: equal weight to all columns
-- Best for: queries filtering on multiple columns equally
CREATE TABLE orders_interleaved (
order_id INTEGER,
customer_id INTEGER,
amount DECIMAL(10,2),
order_date DATE,
status VARCHAR(20)
)
DISTSTYLE KEY DISTKEY (customer_id)
INTERLEAVED SORTKEY (order_date, status, customer_id);Redshift Spectrum
Query data in S3 without loading it into Redshift:
-- Create external schema for S3 data
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'spectrum_db'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftSpectrumRole';
-- Query S3 Parquet data directly
SELECT s3.customer_id, SUM(r.amount) as total
FROM spectrum_schema.s3_orders s3
JOIN redshift_orders r ON s3.customer_id = r.customer_id
WHERE s3.order_date >= '2026-01-01'
GROUP BY s3.customer_id;Platform Comparison
| Feature | Snowflake | BigQuery | Redshift |
|---|---|---|---|
| Storage/Compute | Fully separated | Fully separated | Partially (RA3) |
| Scaling | Multi-cluster warehouses | Slot pool scaling | Node resize |
| Auto-suspend | Yes (configurable) | N/A (serverless) | No (always on) |
| Concurrency | Unlimited (warehouses) | Unlimited (slots) | Limited (node queues) |
| Data lake query | External tables | External tables | Spectrum |
| Pricing | Credit-based | Per-byte or slots | Node-hour reserved |
| SQL compatibility | ANSI SQL + extensions | ANSI SQL + extensions | PostgreSQL-based |
| Best for | Multi-cloud, variable workloads | Ad-hoc analytics, GCP | Fixed workloads, AWS |
Modern Data Warehouse Architecture
flowchart TB
subgraph "Source Data"
S3[(S3 Data Lake)]
API[REST APIs]
ST[Streaming Kafka]
end
subgraph "Modern Warehouse Layer"
subgraph "Snowflake"
SW[Virtual Warehouse]
SS[(Cloud Storage)]
SC[Cloud Services]
end
subgraph "BigQuery"
BS[Slot Pool]
BC[(Colossus)]
BI[BigLake Engine]
end
subgraph "Redshift"
RN[Compute Nodes]
RS[(RA3 Managed Storage)]
RSP[Spectrum Layer]
end
end
subgraph "BI & Analytics"
BI_T[Tableau / Power BI]
NOTE[Notebooks / Python]
APPS[Custom Apps]
end
S3 --> SS
S3 --> BC
S3 --> RS
API --> SW
ST --> BS
ST --> RSP
SS --> SW
BC --> BS
RS --> RN
RSP --> RN
SW --> BI_T
BS --> NOTE
RN --> APPS
Common Modern Warehousing Mistakes
1. Not Setting Auto-Suspend in Snowflake
A warehouse left running 24/7 costs 24 hours of credits even with no queries. Set AUTO_SUSPEND = 300 and AUTO_RESUME = TRUE. For development warehouses, use smaller sizes (XSMALL) and suspend aggressively (60 seconds).
2. Ignoring Partition Pruning in BigQuery
Queries without partition filters scan the entire table. A 1TB table costs $5 per query on-demand. With partition pruning on a daily partitioned table, a single-day query scans ~3GB and costs $0.015 — a 300x savings.
3. Wrong Distribution Style in Redshift
Using DISTSTYLE EVEN on a fact table that joins with DISTSTYLE ALL dimensions causes broadcast joins that copy data to every node. Use DISTSTYLE KEY on the joining column to collocate joins.
4. Not Using Clustering on Large Snowflake Tables
Snowflake micro-partitions automatically cluster, but tables over 1TB benefit from explicit CLUSTER BY on high-cardinality filter columns. Without clustering, queries scan all micro-partitions instead of pruning.
5. Over-Provisioning Redshift Nodes
Redshift charges by node-hour regardless of usage. An 8-node cluster costs 8x a single node. Start with 1-2 nodes and scale up as needed. Use STL_QUERY to monitor if nodes are underutilized.
6. Mixing Storage and Compute in Snowflake
One virtual warehouse for all workloads (ETL, reporting, ad-hoc) causes contention. ETL jobs consume credits and slow down dashboard queries. Create separate warehouses: loading_wh for ETL, analytics_wh for dashboards, dev_wh for development.
7. Not Compressing Data Before Loading
Snowflake and Redshift compress columnar data automatically, but CSV ingestion skips compression benefits. Convert CSVs to Parquet or Snowflake’s native table format before loading. Parquet reduces storage 2-5x over CSV.
Practice Questions
1. How does Snowflake separate storage and compute?
Snowflake stores all data in compressed columnar format in cloud object storage (S3, Azure Blob). Compute is handled by virtual warehouses — independent clusters that can be created, resized, and suspended without affecting stored data.
2. What are BigQuery slots and how does flat-rate pricing work?
Slots are units of compute capacity (one slot = one thread). On-demand pricing has no slot limit but charges per byte processed. Flat-rate pricing reserves a fixed number of slots (100, 500, 1000+) for a predictable monthly cost.
3. When would you use Redshift Spectrum instead of loading data?
Spectrum queries data in S3 without loading into Redshift tables. Use it for archival data queried infrequently, data lake exploration, or when loading would exceed Redshift storage capacity.
4. What is Snowflake clustering and when should you use it?
Clustering reorganizes data within micro-partitions based on specified columns. Use it on tables over 1TB with frequent queries filtering on those columns. Small tables (< 1GB) don’t benefit from clustering.
5. Challenge: Design a data warehouse architecture for a SaaS company with 5TB of daily event data, real-time dashboards, and ad-hoc analytics. Compare Snowflake, BigQuery, and Redshift approaches.
Use Snowflake: one LARGE multi-cluster warehouse for real-time ingestion, one MEDIUM warehouse for dashboards (auto-suspend 5 min), and one XSMALL warehouse for ad-hoc queries. Cluster event tables by event_date. For BigQuery: use partitioned + clustered tables, flat-rate 500 slots for predictable pricing. For Redshift: use RA3 nodes with DISTKEY on customer_id and COMPOUND SORTKEY on event_date.
Mini Project: Snowflake Data Pipeline
# snowflake_pipeline.py
# Data pipeline for Snowflake using Python connector
import snowflake.connector
from datetime import date, timedelta
import json
# Connection configuration
conn = snowflake.connector.connect(
user='DODATECH_USER',
password='***',
account='dodatech.snowflakecomputing.com',
warehouse='loading_wh',
database='analytics',
schema='public',
)
try:
cursor = conn.cursor()
# Create table with clustering
cursor.execute("""
CREATE OR REPLACE TABLE daily_metrics (
metric_date DATE,
metric_name VARCHAR(100),
metric_value FLOAT,
source_system VARCHAR(50),
ingestion_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
CLUSTER BY (metric_date, metric_name)
""")
print("Table created with clustering")
# Insert sample data
today = date.today()
for i in range(7):
day = today - timedelta(days=i)
metrics = [
(day, 'page_views', 15000 + i * 100, 'web'),
(day, 'api_calls', 5000 - i * 50, 'api'),
(day, 'signups', 150 + i * 5, 'web'),
]
cursor.executemany(
"INSERT INTO daily_metrics (metric_date, metric_name, metric_value, source_system) "
"VALUES (%s, %s, %s, %s)",
metrics
)
print(f"Inserted {7 * 3} metric rows")
# Query with clustering
cursor.execute("""
SELECT metric_date, SUM(metric_value) as total
FROM daily_metrics
WHERE metric_date >= %s
AND metric_name IN ('page_views', 'api_calls')
GROUP BY metric_date
ORDER BY metric_date
""", (today - timedelta(days=7),))
print("\nDaily totals (last 7 days):")
for row in cursor.fetchall():
print(f" {row[0]}: {row[1]}")
finally:
cursor.close()
conn.close()Expected output:
Table created with clustering
Inserted 21 metric rows
Daily totals (last 7 days):
2026-06-14: 19550.0
2026-06-15: 19600.0
2026-06-16: 19650.0
2026-06-17: 19700.0
2026-06-18: 19750.0
2026-06-19: 19800.0
2026-06-20: 19850.0Related Concepts
FAQ
What’s Next
You now understand modern data warehousing on Snowflake, BigQuery, and Redshift. Next, explore data lakes and lakehouse architecture for combining warehouse capabilities with data lake flexibility.
- Practice daily — Sign up for a Snowflake trial and create a virtual warehouse with auto-suspend
- Build a project — Migrate a small MySQL dataset to BigQuery with partitioning and clustering
- Compare platforms — Run the same query on Snowflake (on-demand), BigQuery (on-demand), and calculate cost per query
Remember: every expert was once a beginner. Keep querying!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro