Learn Data Modeling Guide — Kimball, Inmon, Star Schema, and Slowly Changing Dimensions

Data Modeling Guide — Kimball, Inmon, Star Schema, and Slowly Changing Dimensions

DodaTech Updated Jun 15, 2026 10 min read

Data modeling is the process of designing how data is structured, stored, and related — defining fact and dimension tables, hierarchies, and relationships to enable efficient analytical querying.

What You’ll Learn

By the end of this tutorial, you’ll understand dimensional modeling (Kimball), Data Vault (Inmon), normalization vs denormalization, slowly changing dimensions (SCD Type 1, 2, 3), and how to design a customer dimension table.

Why Data Modeling Matters

A poorly modeled warehouse produces confusing dashboards, slow queries, and inconsistent metrics. A good data model makes analytics intuitive — sales by region, customer lifetime value, product performance — all just a few JOINs away. DodaTech’s analytics team uses dimensional modeling to power Doda Browser dashboards used by product managers daily.

Data Modeling Learning Path


flowchart LR
  A[Data Warehousing] --> B[Data Modeling]
  B --> C{You Are Here}
  C --> D[Kimball]
  C --> E[Inmon]
  C --> F[SCD Types]
  D --> G[Star Schema]
  E --> H[Data Vault]
  F --> I[Type 1]
  F --> J[Type 2]
  F --> K[Type 3]

Prerequisites: SQL proficiency, understanding of data warehousing and star schemas.

What Is Data Modeling?

Think of data modeling like organizing a warehouse. You decide which items go on which shelves, how they’re labeled, and which items are stored together. A well-organized warehouse makes it fast to find anything. A disorganized one means searching through every box.

In analytics: facts are boxes of numbers (sales amounts, quantities). Dimensions are the labels (product name, customer segment, date). The model defines how they relate.

Kimball vs Inmon

Kimball (Dimensional Modeling)

Ralph Kimball’s approach: design star schemas organized by business process. Each process (sales, inventory, support tickets) gets its own fact table surrounded by shared dimension tables.

Best for: Agile, iterative data warehouse development. Start with one business process, then add more.

Inmon (Corporate Information Factory)

Bill Inmon’s approach: start with a normalized enterprise data model (3NF), then build dimensional data marts on top.

Best for: Large enterprises requiring a single source of truth across all departments.

	Kimball	Inmon
Schema	Denormalized (star)	Normalized (3NF)
Development	Bottom-up (process by process)	Top-down (enterprise model first)
Speed to value	Faster (first mart in weeks)	Slower (months to model everything)
Flexibility	High — easy to add new processes	Lower — changes affect entire model
Popularity	Most common in modern data teams	Common in large enterprises

Star Schema Design

-- Star schema for sales analytics
CREATE TABLE dim_customer (
    customer_sk INTEGER PRIMARY KEY,
    customer_id TEXT NOT NULL,
    full_name TEXT,
    email TEXT,
    city TEXT,
    state TEXT,
    segment TEXT,
    effective_date DATE,
    end_date DATE,
    is_current BOOLEAN DEFAULT TRUE
);

CREATE TABLE dim_product (
    product_sk INTEGER PRIMARY KEY,
    product_id TEXT,
    product_name TEXT,
    category TEXT,
    subcategory TEXT,
    unit_price DECIMAL(10,2)
);

CREATE TABLE dim_date (
    date_sk INTEGER PRIMARY KEY,
    full_date DATE,
    year INTEGER,
    quarter INTEGER,
    month INTEGER,
    day_of_week INTEGER,
    is_holiday BOOLEAN
);

CREATE TABLE fact_sales (
    sales_sk INTEGER PRIMARY KEY,
    customer_sk INTEGER REFERENCES dim_customer(customer_sk),
    product_sk INTEGER REFERENCES dim_product(product_sk),
    date_sk INTEGER REFERENCES dim_date(date_sk),
    quantity INTEGER,
    unit_price DECIMAL(10,2),
    discount DECIMAL(5,2),
    total_amount DECIMAL(12,2)
);

Slowly Changing Dimensions (SCD)

A customer’s city changes. Without SCD handling, you lose historical context. SCD types define how to track changes in dimension attributes.

SCD Type 1 — Overwrite

Overwrite the old value with the new. No history preserved.

-- SCD Type 1: Update in place
UPDATE dim_customer
SET city = 'New York', state = 'NY'
WHERE customer_id = 'CUST001';

Before	After
Alice, Boston, MA	Alice, New York, NY
(No record of Boston)

Use when: Correcting errors (typo in city name), attributes that don’t need history.

SCD Type 2 — Add New Row

Add a new row with effective dates. Preserves full history.

-- SCD Type 2: Expire old, insert new
UPDATE dim_customer
SET end_date = '2026-06-15',
    is_current = FALSE
WHERE customer_id = 'CUST001'
  AND is_current = TRUE;

INSERT INTO dim_customer
(customer_sk, customer_id, full_name, city, state,
 effective_date, end_date, is_current)
VALUES
(1002, 'CUST001', 'Alice', 'New York', 'NY',
 '2026-06-15', NULL, TRUE);

customer_sk	customer_id	city	state	effective_date	end_date	is_current
1001	CUST001	Boston	MA	2020-01-01	2026-06-14	FALSE
1002	CUST001	New York	NY	2026-06-15	NULL	TRUE

Use when: You need historical reporting (revenue by customer city at time of sale).

SCD Type 3 — Add Previous Value Column

Store both current and previous values in the same row.

-- SCD Type 3: Add previous value columns
ALTER TABLE dim_customer
ADD COLUMN previous_city TEXT,
ADD COLUMN previous_state TEXT;

UPDATE dim_customer
SET previous_city = city,
    previous_state = state,
    city = 'New York',
    state = 'NY'
WHERE customer_id = 'CUST001'
  AND is_current = TRUE;

customer_id	city	state	previous_city	previous_state
CUST001	New York	NY	Boston	MA

Use when: Need to compare “current” vs “previous” (e.g., “customers who moved from Boston”).

Data Vault (Inmon/Data Vault 2.0)

Data Vault combines the best of Kimball and Inmon with three entity types:

Entity	Purpose	Example
Hub	Business keys (unique identifiers)	`hub_customer (customer_id)`
Satellite	Descriptive attributes (time-bound)	`sat_customer_detail (name, address, effective_date)`
Link	Relationships between hubs	`link_customer_order (customer_id, order_id)`

-- Data Vault example
CREATE TABLE hub_customer (
    customer_hk CHAR(32) PRIMARY KEY,
    customer_id TEXT NOT NULL,
    load_date TIMESTAMP,
    record_source TEXT
);

CREATE TABLE sat_customer_detail (
    customer_hk CHAR(32) REFERENCES hub_customer(customer_hk),
    hash_diff CHAR(32),
    full_name TEXT,
    email TEXT,
    city TEXT,
    state TEXT,
    effective_date TIMESTAMP,
    load_date TIMESTAMP,
    PRIMARY KEY (customer_hk, effective_date)
);

CREATE TABLE link_customer_order (
    link_hk CHAR(32) PRIMARY KEY,
    customer_hk CHAR(32) REFERENCES hub_customer(customer_hk),
    order_hk CHAR(32) REFERENCES hub_order(order_hk),
    load_date TIMESTAMP
);

Best for: Auditing, massive data volumes, agile warehousing.

Designing a Customer Dimension with SCD Type 2

# scd_type2_demo.py
# Implement SCD Type 2 for a customer dimension
from datetime import date, datetime
import json

class CustomerDim:
    def __init__(self):
        self.records = []
        self.next_sk = 1

    def get_current(self, customer_id):
        for r in self.records:
            if r["customer_id"] == customer_id and r["is_current"]:
                return r
        return None

    def upsert(self, customer_id, name, city, state, change_date):
        current = self.get_current(customer_id)

        if current and current["city"] == city and current["state"] == state:
            return

        # Expire current record
        if current:
            current["end_date"] = change_date
            current["is_current"] = False

        # Insert new record
        new_record = {
            "customer_sk": self.next_sk,
            "customer_id": customer_id,
            "full_name": name,
            "city": city,
            "state": state,
            "effective_date": change_date,
            "end_date": None,
            "is_current": True,
        }
        self.records.append(new_record)
        self.next_sk += 1
        print(f"[SCD2] {customer_id}: {'updated' if current else 'created'} — {city}, {state}")

    def as_of(self, query_date):
        """Return records as they were on a specific date."""
        result = []
        for r in self.records:
            eff = r["effective_date"]
            end = r["end_date"] or date(9999, 12, 31)
            if eff <= query_date <= end:
                result.append(r)
        return result

dim = CustomerDim()

# Initial load
dim.upsert("C001", "Alice", "Boston", "MA", date(2020, 1, 1))
dim.upsert("C002", "Bob", "Chicago", "IL", date(2020, 1, 1))
dim.upsert("C003", "Charlie", "Seattle", "WA", date(2020, 6, 1))

# Alice moves to New York
dim.upsert("C001", "Alice", "New York", "NY", date(2023, 3, 15))

# Bob moves to San Francisco
dim.upsert("C002", "Bob", "San Francisco", "CA", date(2024, 7, 1))

# Query history
print("\n=== Alice's History ===")
for r in dim.records:
    if r["customer_id"] == "C001":
        print(f"  {r['effective_date']} to {r['end_date'] or 'now'}: {r['city']}, {r['state']}")

print(f"\n=== Customers as of 2022-01-01 ===")
for r in dim.as_of(date(2022, 1, 1)):
    print(f"  {r['full_name']:<10} {r['city']:<15} {r['state']}")

print(f"\n=== Current Customers ===")
for r in dim.records:
    if r["is_current"]:
        print(f"  {r['full_name']:<10} {r['city']:<15} {r['state']}")

Expected output:

[SCD2] C001: created — Boston, MA
[SCD2] C002: created — Chicago, IL
[SCD2] C003: created — Seattle, WA
[SCD2] C001: updated — New York, NY
[SCD2] C002: updated — San Francisco, CA

=== Alice's History ===
  2020-01-01 to 2023-03-14: Boston, MA
  2023-03-15 to now: New York, NY

=== Customers as of 2022-01-01 ===
  Alice      Boston           MA
  Bob        Chicago          IL
  Charlie    Seattle          WA

=== Current Customers ===
  Alice      New York         NY
  Bob        San Francisco    CA
  Charlie    Seattle          WA

Normalization vs Denormalization

	Normalized (3NF)	Denormalized
Storage	Less redundant	More redundant
Write speed	Faster (single place to update)	Slower (multiple rows)
Read speed	Slower (more JOINs)	Faster (fewer JOINs)
Data integrity	Higher (single source of truth)	Lower (update anomalies)
Use case	OLTP (transactional)	OLAP (analytical)

Common Data Modeling Mistakes

1. Over-Normalizing Analytical Models

OLTP needs normalization (3NF) to avoid update anomalies. OLAP prefers star schemas — fewer JOINs, faster queries, easier for analysts.

2. Ignoring SCD Strategy Until Too Late

Your customer dimension has 10M rows and your analyst asks “show me revenue by customer city at time of purchase.” Without SCD Type 2, you can’t answer this accurately.

3. Natural Keys as Primary Keys

Using customer_id (a natural key) instead of a surrogate key (customer_sk) breaks when the source system reuses IDs or changes formats.

4. No Conformed Dimensions

A “date” dimension should be shared across ALL fact tables — sales date, shipment date, return date. Different date dimensions with different attributes create confusion.

5. Technology Over Methodology

Jumping into Data Vault because it’s trendy, without understanding whether your use case needs auditability and agile modeling. Kimball suffices for most teams.

Practice Questions

1. What is the difference between Kimball and Inmon data modeling?

Kimball uses bottom-up dimensional modeling (star schemas per business process, quick to build). Inmon uses top-down normalized enterprise model, then dimensional marts. Kimball dominates modern data teams.

2. What are slowly changing dimensions and why do they matter?

SCDs track changes to dimension attributes over time. Type 1 overwrites (no history), Type 2 adds rows (full history), Type 3 stores previous value. They matter for accurate historical reporting.

3. What is a surrogate key and why use it?

A surrogate key is an artificial unique identifier (auto-increment INTEGER) separate from the business key. It protects against source system changes, supports SCD Type 2, and improves join performance.

4. What are conformed dimensions?

Dimensions shared across multiple fact tables with the same structure and content. Example: the same dim_date used in sales, inventory, and support ticket fact tables.

5. Challenge: Model a subscription analytics schema with plans, customers, payments, and plan changes. Design SCD handling for plan changes.

Fact: fct_payments (payment_sk, customer_sk, plan_sk, date_sk, amount). Dim: dim_customer (SCD Type 2 for plan tier), dim_plan (static plan details), dim_date. Track plan changes via SCD Type 2 on customer dimension with plan_tier attribute.

Mini Project: SCD Type 2 Loader

# scd_loader.py
# Generic SCD Type 2 loader for any dimension
from datetime import date
import hashlib

class SCD2Loader:
    def __init__(self, dim_name, business_key):
        self.dim_name = dim_name
        self.business_key = business_key
        self.current_state = {}

    def hash_record(self, record):
        """Hash all attributes except business key for change detection."""
        relevant = {k: v for k, v in sorted(record.items()) if k != self.business_key}
        return hashlib.md5(str(relevant).encode()).hexdigest()[:16]

    def process(self, records, change_date):
        changes = 0
        for record in records:
            bk = record[self.business_key]
            new_hash = self.hash_record(record)

            if bk in self.current_state:
                if self.current_state[bk]["hash"] == new_hash:
                    continue

            old_record = self.current_state.get(bk)
            if old_record:
                print(f"  [{self.dim_name}] {bk}: changed {old_record['record']} → {record}")

            self.current_state[bk] = {
                "hash": new_hash,
                "record": record,
                "effective_date": change_date,
            }
            changes += 1

        return changes

loader = SCD2Loader("dim_customer", "customer_id")

print("=== Day 1: Initial Load ===")
day1 = [
    {"customer_id": "C001", "name": "Alice", "tier": "basic"},
    {"customer_id": "C002", "name": "Bob", "tier": "premium"},
]
changes = loader.process(day1, date(2026, 1, 1))
print(f"Changes: {changes}")

print("\n=== Day 90: Alice upgrades ===")
day2 = [
    {"customer_id": "C001", "name": "Alice", "tier": "premium"},
    {"customer_id": "C002", "name": "Bob", "tier": "premium"},
]
changes = loader.process(day2, date(2026, 4, 1))
print(f"Changes: {changes}")

print("\n=== Day 180: Bob downgrades ===")
day3 = [
    {"customer_id": "C001", "name": "Alice", "tier": "premium"},
    {"customer_id": "C002", "name": "Bob", "tier": "basic"},
]
changes = loader.process(day3, date(2026, 7, 1))
print(f"Changes: {changes}")

print(f"\nFinal state: {len(loader.current_state)} customers")

Expected output:

=== Day 1: Initial Load ===
  [dim_customer] C001: created
  [dim_customer] C002: created
Changes: 2

=== Day 90: Alice upgrades ===
  [dim_customer] C001: changed {'name': 'Alice', 'tier': 'basic'} → {'name': 'Alice', 'tier': 'premium'}
Changes: 1

=== Day 180: Bob downgrades ===
  [dim_customer] C002: changed {'name': 'Bob', 'tier': 'premium'} → {'name': 'Bob', 'tier': 'basic'}
Changes: 1

Final state: 2 customers

Related Concepts

Data Warehousing

Building Data Pipelines

dbt (Data Build Tool)

What’s Next

You now understand data modeling fundamentals! Apply these concepts with dbt to build and test models, and explore Apache Spark for processing large-scale dimension tables.

Practice daily — Design a star schema for a domain you use (Spotify plays, Uber rides)
Build a project — Implement SCD Type 2 for a customer dimension in your local database
Explore related topics — Check out Data Vault 2.0 for enterprise data modeling

Remember: every expert was once a beginner. Keep coding!

Previous Building Data Pipelines — End-to-End Design, Monitoring, and Orchestration Next Data Lakehouse Architecture — Delta Lake, Iceberg, and Hudi Explained

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Engineering