Learn Modern Data Warehousing — Snowflake, BigQuery, Redshift Guide

Modern Data Warehousing — Snowflake, BigQuery, Redshift Guide

DodaTech Updated Jun 20, 2026 11 min read

Modern data warehousing decouples storage from compute, auto-scales infrastructure, and provides SQL analytics on petabyte-scale data without managing servers.

What You’ll Learn

By the end of this tutorial, you’ll understand how Snowflake, BigQuery, and Redshift work under the hood, how to design tables for each platform, and how to choose the right warehouse for your workload.

Why Modern Warehousing Matters

Traditional on-premise data warehouses required capacity planning, hardware procurement, and DBA teams. Modern cloud warehouses let you store petabytes and pay per query. Snowflake auto-suspends when idle, BigQuery processes terabytes in seconds, and Redshift Spectrum queries data lakes directly. Doda Browser uses Snowflake for its analytics platform, reducing query time by 80% and cutting storage costs by 60% compared to its legacy warehouse.

Data Warehousing Learning Path


flowchart LR
  A[Big Data Overview] --> B[Data Warehousing]
  B --> C{You Are Here}
  C --> D[Snowflake]
  C --> E[BigQuery]
  C --> F[Redshift]
  D --> G[Storage/Compute]
  E --> H[Slots & Partitioning]
  F --> I[Distribution & Sort]

Prerequisites: SQL proficiency, Cloud Computing basics, data warehousing concepts.

What Is a Modern Data Warehouse?

Think of the old warehouse as a single giant refrigerator: cooling and food are in one box. If you want more capacity, you buy a bigger fridge (vertical scaling). A modern warehouse is like a restaurant kitchen with a walk-in cooler and separate cooking stations: storage (the cooler) scales independently from compute (the stoves). You can double the stoves without buying more cooler space.

The three major cloud warehouses implement this differently:

Platform	Storage	Compute	Pricing Model
Snowflake	Compressed columnar (S3)	Virtual warehouses	Compute credits + storage
BigQuery	Colossus (Google FS)	Slot pool	Per-slot or on-demand
Redshift	RA3 managed storage	Node-based clusters	Node-hours + storage

Snowflake Deep Dive

Snowflake’s architecture separates storage (compressed, columnar, in S3) from compute (virtual warehouses). You can have multiple virtual warehouses querying the same data without contention.

Virtual Warehouses

-- Create a virtual warehouse
CREATE WAREHOUSE analytics_wh
    WITH WAREHOUSE_SIZE = 'MEDIUM'    -- XSMALL, SMALL, MEDIUM, LARGE, XLARGE...
    AUTO_SUSPEND = 300                -- Suspend after 5 min idle
    AUTO_RESUME = TRUE                -- Resume automatically on query
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 3
    SCALING_POLICY = 'ECONOMY';       -- 'ECONOMY' or 'STANDARD'

-- Create tables
CREATE OR REPLACE TABLE orders (
    order_id INTEGER AUTOINCREMENT,
    customer_id INTEGER,
    amount DECIMAL(10,2),
    order_date DATE,
    status VARCHAR(20)
)
CLUSTER BY (order_date);

INSERT INTO orders (customer_id, amount, order_date, status) VALUES
    (1001, 250.00, '2026-06-01', 'completed'),
    (1002, 45.50, '2026-06-01', 'pending'),
    (1003, 1200.00, '2026-06-02', 'completed');

-- Query using the warehouse
USE WAREHOUSE analytics_wh;
SELECT status, COUNT(*), SUM(amount)
FROM orders
WHERE order_date >= '2026-06-01'
GROUP BY status;

Expected output:

+-----------+----------+-------------+
| STATUS    | COUNT(*) | SUM(AMOUNT) |
+-----------+----------+-------------+
| completed |        2 |     1450.00 |
| pending   |        1 |       45.50 |
+-----------+----------+-------------+

Clustering and Auto-Suspend

-- Snowflake clustering for large tables
ALTER TABLE orders CLUSTER BY (order_date);
-- Recluster manually
ALTER TABLE orders RECLUSTER;
-- Check clustering state
SELECT SYSTEM$CLUSTERING_INFORMATION('orders');

-- Auto-suspend: warehouse suspends after inactivity
-- SHOW WAREHOUSES will show "Suspended" state
SHOW WAREHOUSES LIKE 'analytics_wh';

-- Resume automatically on first query
SELECT COUNT(*) FROM orders;  -- Warehouse auto-resumes

Time Travel and Zero-Copy Cloning

-- Time travel: query data as it existed 1 hour ago
SELECT * FROM orders AT (TIMESTAMP => DATEADD(hour, -1, CURRENT_TIMESTAMP()));

-- Undrop a dropped table
DROP TABLE orders;
UNDROP TABLE orders;

-- Zero-copy clone (instant, no additional storage)
CREATE OR REPLACE TABLE orders_dev CLONE orders;

BigQuery Deep Dive

BigQuery uses Google’s Colossus file system for storage and a slot-based compute model. Slots are units of computational capacity (one slot = one thread).

Partitioned and Clustered Tables

-- Partitioned table (by date)
CREATE OR REPLACE TABLE `project.dataset.orders`
PARTITION BY order_date
CLUSTER BY customer_id
AS
SELECT * FROM `project.dataset.raw_orders`;

-- Query against partitioned table (prunes partitions)
SELECT customer_id, SUM(amount) as total_spent
FROM `project.dataset.orders`
WHERE order_date BETWEEN '2026-06-01' AND '2026-06-30'
GROUP BY customer_id;

-- Information schema for partition metadata
SELECT table_name, partition_id, total_rows, total_logical_bytes
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`;

Slots and On-Demand vs Flat-Rate

-- On-demand: unlimited slots, pay per byte processed
-- This query costs ~$0.005 per GB processed
SELECT COUNT(DISTINCT customer_id) FROM orders;

-- With clustering, BQ processes fewer bytes
SELECT COUNT(*) FROM orders WHERE order_date = '2026-06-01';
-- Clustered: scans only the partition + cluster block
-- Unclustered: scans entire table

-- Monitor slot usage
SELECT * FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE DATE(creation_time) = CURRENT_DATE();

BigQuery Optimization Example

-- BAD: Full table scan
SELECT DATE(order_date) as day, SUM(amount) as revenue
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY day;
-- Processes all partitions from 2025

-- GOOD: Pruned with partitioning
SELECT DATE(order_date) as day, SUM(amount) as revenue
FROM orders
WHERE order_date BETWEEN '2026-06-01' AND '2026-06-30'
GROUP BY day;
-- Processes only June 2026 partition

-- MATERIALIZED VIEW for common aggregations
CREATE MATERIALIZED VIEW `project.dataset.daily_revenue` AS
SELECT order_date, SUM(amount) as revenue, COUNT(*) as orders_count
FROM orders
GROUP BY order_date;

Redshift Deep Dive

Amazon Redshift uses a shared-nothing MPP architecture. Data is distributed across nodes, and queries run in parallel.

Distribution Styles

-- DISTSTYLE EVEN: rows distributed round-robin
-- Best for: staging tables, no joins
CREATE TABLE orders_even (
    order_id INTEGER,
    customer_id INTEGER,
    amount DECIMAL(10,2)
) DISTSTYLE EVEN;

-- DISTSTYLE KEY: rows distributed by hash of key column
-- Best for: fact tables joined on the distribution key
CREATE TABLE orders_key (
    order_id INTEGER,
    customer_id INTEGER DISTKEY,  -- Distribute by customer
    amount DECIMAL(10,2)
) DISTSTYLE KEY;

-- DISTSTYLE ALL: full copy on every node
-- Best for: small dimension tables (< 1M rows)
CREATE TABLE customers (
    customer_id INTEGER,
    name VARCHAR(100),
    city VARCHAR(50)
) DISTSTYLE ALL;

Sort Keys

-- COMPOUND sort key: most selective column first
-- Best for: queries that filter on prefix of sort keys
CREATE TABLE orders_compound (
    order_id INTEGER,
    customer_id INTEGER,
    amount DECIMAL(10,2),
    order_date DATE,
    status VARCHAR(20)
)
DISTSTYLE KEY DISTKEY (customer_id)
COMPOUND SORTKEY (order_date, status);

-- INTERLEAVED sort key: equal weight to all columns
-- Best for: queries filtering on multiple columns equally
CREATE TABLE orders_interleaved (
    order_id INTEGER,
    customer_id INTEGER,
    amount DECIMAL(10,2),
    order_date DATE,
    status VARCHAR(20)
)
DISTSTYLE KEY DISTKEY (customer_id)
INTERLEAVED SORTKEY (order_date, status, customer_id);

Redshift Spectrum

Query data in S3 without loading it into Redshift:

-- Create external schema for S3 data
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'spectrum_db'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftSpectrumRole';

-- Query S3 Parquet data directly
SELECT s3.customer_id, SUM(r.amount) as total
FROM spectrum_schema.s3_orders s3
JOIN redshift_orders r ON s3.customer_id = r.customer_id
WHERE s3.order_date >= '2026-01-01'
GROUP BY s3.customer_id;

Platform Comparison

Feature	Snowflake	BigQuery	Redshift
Storage/Compute	Fully separated	Fully separated	Partially (RA3)
Scaling	Multi-cluster warehouses	Slot pool scaling	Node resize
Auto-suspend	Yes (configurable)	N/A (serverless)	No (always on)
Concurrency	Unlimited (warehouses)	Unlimited (slots)	Limited (node queues)
Data lake query	External tables	External tables	Spectrum
Pricing	Credit-based	Per-byte or slots	Node-hour reserved
SQL compatibility	ANSI SQL + extensions	ANSI SQL + extensions	PostgreSQL-based
Best for	Multi-cloud, variable workloads	Ad-hoc analytics, GCP	Fixed workloads, AWS

Modern Data Warehouse Architecture


flowchart TB
  subgraph "Source Data"
    S3[(S3 Data Lake)]
    API[REST APIs]
    ST[Streaming Kafka]
  end
  subgraph "Modern Warehouse Layer"
    subgraph "Snowflake"
      SW[Virtual Warehouse]
      SS[(Cloud Storage)]
      SC[Cloud Services]
    end
    subgraph "BigQuery"
      BS[Slot Pool]
      BC[(Colossus)]
      BI[BigLake Engine]
    end
    subgraph "Redshift"
      RN[Compute Nodes]
      RS[(RA3 Managed Storage)]
      RSP[Spectrum Layer]
    end
  end
  subgraph "BI & Analytics"
    BI_T[Tableau / Power BI]
    NOTE[Notebooks / Python]
    APPS[Custom Apps]
  end
  S3 --> SS
  S3 --> BC
  S3 --> RS
  API --> SW
  ST --> BS
  ST --> RSP
  SS --> SW
  BC --> BS
  RS --> RN
  RSP --> RN
  SW --> BI_T
  BS --> NOTE
  RN --> APPS

Common Modern Warehousing Mistakes

1. Not Setting Auto-Suspend in Snowflake

A warehouse left running 24/7 costs 24 hours of credits even with no queries. Set AUTO_SUSPEND = 300 and AUTO_RESUME = TRUE. For development warehouses, use smaller sizes (XSMALL) and suspend aggressively (60 seconds).

2. Ignoring Partition Pruning in BigQuery

Queries without partition filters scan the entire table. A 1TB table costs $5 per query on-demand. With partition pruning on a daily partitioned table, a single-day query scans ~3GB and costs $0.015 — a 300x savings.

3. Wrong Distribution Style in Redshift

Using DISTSTYLE EVEN on a fact table that joins with DISTSTYLE ALL dimensions causes broadcast joins that copy data to every node. Use DISTSTYLE KEY on the joining column to collocate joins.

4. Not Using Clustering on Large Snowflake Tables

Snowflake micro-partitions automatically cluster, but tables over 1TB benefit from explicit CLUSTER BY on high-cardinality filter columns. Without clustering, queries scan all micro-partitions instead of pruning.

5. Over-Provisioning Redshift Nodes

Redshift charges by node-hour regardless of usage. An 8-node cluster costs 8x a single node. Start with 1-2 nodes and scale up as needed. Use STL_QUERY to monitor if nodes are underutilized.

6. Mixing Storage and Compute in Snowflake

One virtual warehouse for all workloads (ETL, reporting, ad-hoc) causes contention. ETL jobs consume credits and slow down dashboard queries. Create separate warehouses: loading_wh for ETL, analytics_wh for dashboards, dev_wh for development.

7. Not Compressing Data Before Loading

Snowflake and Redshift compress columnar data automatically, but CSV ingestion skips compression benefits. Convert CSVs to Parquet or Snowflake’s native table format before loading. Parquet reduces storage 2-5x over CSV.

Practice Questions

1. How does Snowflake separate storage and compute?

Snowflake stores all data in compressed columnar format in cloud object storage (S3, Azure Blob). Compute is handled by virtual warehouses — independent clusters that can be created, resized, and suspended without affecting stored data.

2. What are BigQuery slots and how does flat-rate pricing work?

Slots are units of compute capacity (one slot = one thread). On-demand pricing has no slot limit but charges per byte processed. Flat-rate pricing reserves a fixed number of slots (100, 500, 1000+) for a predictable monthly cost.

3. When would you use Redshift Spectrum instead of loading data?

Spectrum queries data in S3 without loading into Redshift tables. Use it for archival data queried infrequently, data lake exploration, or when loading would exceed Redshift storage capacity.

4. What is Snowflake clustering and when should you use it?

Clustering reorganizes data within micro-partitions based on specified columns. Use it on tables over 1TB with frequent queries filtering on those columns. Small tables (< 1GB) don’t benefit from clustering.

5. Challenge: Design a data warehouse architecture for a SaaS company with 5TB of daily event data, real-time dashboards, and ad-hoc analytics. Compare Snowflake, BigQuery, and Redshift approaches.

Use Snowflake: one LARGE multi-cluster warehouse for real-time ingestion, one MEDIUM warehouse for dashboards (auto-suspend 5 min), and one XSMALL warehouse for ad-hoc queries. Cluster event tables by event_date. For BigQuery: use partitioned + clustered tables, flat-rate 500 slots for predictable pricing. For Redshift: use RA3 nodes with DISTKEY on customer_id and COMPOUND SORTKEY on event_date.

Mini Project: Snowflake Data Pipeline

# snowflake_pipeline.py
# Data pipeline for Snowflake using Python connector
import snowflake.connector
from datetime import date, timedelta
import json

# Connection configuration
conn = snowflake.connector.connect(
    user='DODATECH_USER',
    password='***',
    account='dodatech.snowflakecomputing.com',
    warehouse='loading_wh',
    database='analytics',
    schema='public',
)

try:
    cursor = conn.cursor()
    
    # Create table with clustering
    cursor.execute("""
        CREATE OR REPLACE TABLE daily_metrics (
            metric_date DATE,
            metric_name VARCHAR(100),
            metric_value FLOAT,
            source_system VARCHAR(50),
            ingestion_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
        )
        CLUSTER BY (metric_date, metric_name)
    """)
    print("Table created with clustering")
    
    # Insert sample data
    today = date.today()
    for i in range(7):
        day = today - timedelta(days=i)
        metrics = [
            (day, 'page_views', 15000 + i * 100, 'web'),
            (day, 'api_calls', 5000 - i * 50, 'api'),
            (day, 'signups', 150 + i * 5, 'web'),
        ]
        cursor.executemany(
            "INSERT INTO daily_metrics (metric_date, metric_name, metric_value, source_system) "
            "VALUES (%s, %s, %s, %s)",
            metrics
        )
    
    print(f"Inserted {7 * 3} metric rows")
    
    # Query with clustering
    cursor.execute("""
        SELECT metric_date, SUM(metric_value) as total
        FROM daily_metrics
        WHERE metric_date >= %s
            AND metric_name IN ('page_views', 'api_calls')
        GROUP BY metric_date
        ORDER BY metric_date
    """, (today - timedelta(days=7),))
    
    print("\nDaily totals (last 7 days):")
    for row in cursor.fetchall():
        print(f"  {row[0]}: {row[1]}")
    
finally:
    cursor.close()
    conn.close()

Expected output:

Table created with clustering
Inserted 21 metric rows

Daily totals (last 7 days):
  2026-06-14: 19550.0
  2026-06-15: 19600.0
  2026-06-16: 19650.0
  2026-06-17: 19700.0
  2026-06-18: 19750.0
  2026-06-19: 19800.0
  2026-06-20: 19850.0

Related Concepts

Big Data Overview

Data Lakehouse Architecture

Apache Spark Guide

Data Warehousing Guide

FAQ

Can I use Snowflake without managing virtual warehouses?

Yes, with Snowflake’s serverless compute options like Serverless Tasks and automatic clustering. You can set AUTO_SUSPEND = 60 and AUTO_RESUME = TRUE to make warehouses feel serverless. BigQuery is fully serverless — no warehouse management needed.

How does BigQuery slot reservation work with partitioning?

Slot reservation is independent of partitioning. Partitioning reduces data scanned (and thus cost on-demand), but slot reservation gives fixed compute capacity. With flat-rate slots, partition pruning still reduces bytes processed, freeing slots for other queries.

Is Redshift Spectrum slower than querying Redshift tables?

Spectrum queries are slower for repeated access because data stays in S3 without indexes or sort keys. For one-time or infrequent queries, Spectrum is cost-effective. For frequent dashboards, load data into Redshift tables with appropriate sort keys.

What is Snowflake’s multi-cluster warehouse and when should I use it?

Multi-cluster warehouses automatically add compute clusters when queuing occurs. Use them for variable concurrency workloads (e.g., a dashboard that gets 100 users at 9 AM and 10 users at 2 PM). Set MAX_CLUSTER_COUNT = 3-10 depending on peak concurrency.

What’s Next

You now understand modern data warehousing on Snowflake, BigQuery, and Redshift. Next, explore data lakes and lakehouse architecture for combining warehouse capabilities with data lake flexibility.

Practice daily — Sign up for a Snowflake trial and create a virtual warehouse with auto-suspend
Build a project — Migrate a small MySQL dataset to BigQuery with partitioning and clustering
Compare platforms — Run the same query on Snowflake (on-demand), BigQuery (on-demand), and calculate cost per query

Remember: every expert was once a beginner. Keep querying!

Previous Data Lake Architecture — Medallion, Delta Lake, Iceberg & Hudi Next Stream Processing Deep Dive — Event Time, Watermarks & Exactly-Once

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Big Data & Analytics