Learn Advanced Data Modeling — Kimball vs Inmon, SCD Types, Fact Tables

Advanced Data Modeling — Kimball vs Inmon, SCD Types, Fact Tables

DodaTech Updated Jun 20, 2026 13 min read

Advanced data modeling covers the methodologies and patterns used to design scalable, maintainable data warehouses — from dimensional modeling choices to handling slowly changing dimensions and complex fact tables.

What You’ll Learn

By the end of this tutorial, you’ll understand Kimball vs Inmon methodologies, implement SCD Types 1/2/3, design fact tables (periodic snapshot, accumulating snapshot, transactional), and know when to use bridge tables, conformed dimensions, and junk dimensions.

Why Advanced Data Modeling Matters

Poor data modeling leads to slow queries, confusing schemas, and brittle pipelines. A retail company using a single flat table for all sales data ends up with 200-column tables, duplicate data, and queries that take 30 minutes. Proper dimensional modeling reduces query time by 90% and makes the schema understandable to business users. Doda Browser uses Kimball-style star schemas for its analytics platform, processing 10M+ events daily with sub-second query latency.

Data Modeling Learning Path


flowchart LR
  A[Data Modeling Basics] --> B[Advanced Data Modeling]
  B --> C{You Are Here}
  C --> D[Kimball vs Inmon]
  C --> E[SCD Types]
  C --> F[Fact Tables]
  D --> G[Dimensional Modeling]
  E --> H[Type 1/2/3]
  F --> I[Snapshot/Agg/Transaction]

Prerequisites: data modeling basics, SQL proficiency, understanding of data warehousing concepts.

Kimball vs Inmon Methodologies

Kimball (Bottom-Up)

Ralph Kimball’s approach builds the warehouse one business process at a time using star schemas (fact + dimension tables). Start with the most important process (e.g., sales), model it as a star schema, and add more processes incrementally.

Inmon (Top-Down)

Bill Inmon’s approach builds a normalized Enterprise Data Warehouse (EDW) first (3NF), then creates data marts for specific departments. Everything goes through the EDW first.

Aspect	Kimball	Inmon
Approach	Bottom-up (process by process)	Top-down (enterprise first)
Schema	Dimensional (star)	Normalized (3NF)
Development time	Fast initial delivery	Slow initial delivery
Redundancy	Acceptable (denormalized)	Minimal (normalized)
Business user friendly	Yes (intuitive)	No (complex joins)
Best for	Agile teams, mid-size warehouses	Large enterprises, regulatory environments

-- Kimball-style star schema
CREATE TABLE fact_sales (
    sale_id BIGINT PRIMARY KEY,
    date_key INT REFERENCES dim_date(date_key),
    product_key INT REFERENCES dim_product(product_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    store_key INT REFERENCES dim_store(store_key),
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2) DEFAULT 0
);

CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id VARCHAR(50),
    product_name VARCHAR(200),
    category VARCHAR(100),
    brand VARCHAR(100),
    current_price DECIMAL(10,2)
);

-- Query: sales by category
SELECT p.category, SUM(f.quantity * f.unit_price) AS revenue
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2026
GROUP BY p.category;

-- Inmon-style 3NF model
CREATE TABLE order_header (
    order_id BIGINT PRIMARY KEY,
    customer_id BIGINT REFERENCES customer(customer_id),
    order_date DATE NOT NULL,
    store_id INT REFERENCES store(store_id),
    total_amount DECIMAL(12,2),
    status VARCHAR(20)
);

CREATE TABLE order_line_item (
    line_item_id BIGINT PRIMARY KEY,
    order_id BIGINT REFERENCES order_header(order_id),
    product_id BIGINT REFERENCES product(product_id),
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL
);

-- Query: sales by category (3NF — more joins)
SELECT p.category, SUM(li.quantity * li.unit_price) AS revenue
FROM order_line_item li
JOIN order_header oh ON li.order_id = oh.order_id
JOIN product p ON li.product_id = p.product_id
WHERE EXTRACT(YEAR FROM oh.order_date) = 2026
GROUP BY p.category;

Slowly Changing Dimensions (SCD)

Dimensions change over time. A customer moves to a new city, a product changes price, an employee gets promoted. SCD types define how to handle these changes.

SCD Type 1: Overwrite

Overwrite the old value with the new. No history preserved.

-- SCD Type 1: update in place
-- Customer "Alice" changes email
UPDATE dim_customer
SET email = 'alice.new@email.com',
    last_updated = CURRENT_DATE
WHERE customer_id = 101;

-- Old email is lost after this update
SELECT * FROM dim_customer WHERE customer_id = 101;
-- Result: shows only the new email

customer_id	name	email	city
101	Alice	alice.new@email.com	NY

Use when: Changes are corrections or non-critical attributes. No audit trail needed.

SCD Type 2: Add New Row

Preserve history by adding a new row with effective dates.

-- SCD Type 2: new row for changed attribute
-- Before change
INSERT INTO dim_customer_scd2 
    (customer_id, name, email, city, effective_date, end_date, is_current)
VALUES 
    (101, 'Alice', 'alice@email.com', 'New York', '2025-01-01', '2026-06-15', FALSE);

-- After address change
INSERT INTO dim_customer_scd2 
    (customer_id, name, email, city, effective_date, end_date, is_current)
VALUES 
    (101, 'Alice', 'alice@email.com', 'Los Angeles', '2026-06-16', '9999-12-31', TRUE);

-- NOW the old address is expired, new address is current

customer_id	name	email	city	effective_date	end_date	is_current
101	Alice	alice@email.com	New York	2025-01-01	2026-06-15	FALSE
101	Alice	alice@email.com	Los Angeles	2026-06-16	9999-12-31	TRUE

Use when: You need historical reporting (e.g., “Which city did customers live in when they made a purchase?”).

SCD Type 3: Add Previous Value Column

Keep both old and new values in the same row using additional columns.

-- SCD Type 3: add previous value column
ALTER TABLE dim_customer_scd3 ADD COLUMN previous_city VARCHAR(100);
ALTER TABLE dim_customer_scd3 ADD COLUMN city_changed_date DATE;

-- When customer changes city
UPDATE dim_customer_scd3
SET previous_city = city,
    city = 'Los Angeles',
    city_changed_date = '2026-06-16'
WHERE customer_id = 101;

customer_id	name	city	previous_city	city_changed_date
101	Alice	Los Angeles	New York	2026-06-16

Use when: You need to track at most one previous value per attribute. Common for product price changes.

SCD Comparison

Type	History	Storage	Query complexity	Use case
1	None	Minimal	Simple	Bug fixes, non-important attributes
2	Full	High	Moderate	Audit trails, regulatory compliance
3	One previous	Low	Simple	“What was the previous value?”

Fact Tables

Fact tables store measures (numerics that can be aggregated) and foreign keys to dimensions.

Transaction Fact Table

One row per event. The most common type.

CREATE TABLE fact_transaction_sales (
    sale_id BIGINT PRIMARY KEY,
    date_key INT REFERENCES dim_date(date_key),
    product_key INT REFERENCES dim_product(product_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount DECIMAL(10,2) DEFAULT 0
);

-- Row per individual sale
INSERT INTO fact_transaction_sales VALUES
(1, 20260601, 101, 1001, 2, 25.00, 0),
(2, 20260601, 102, 1002, 1, 50.00, 5.00),
(3, 20260602, 101, 1001, 3, 25.00, 2.50);

Periodic Snapshot Fact Table

One row per period. Captures state at regular intervals.

CREATE TABLE fact_monthly_inventory (
    month_key INT REFERENCES dim_month(month_key),
    product_key INT REFERENCES dim_product(product_key),
    warehouse_key INT REFERENCES dim_warehouse(warehouse_key),
    quantity_on_hand INT NOT NULL,
    quantity_sold INT NOT NULL,
    PRIMARY KEY (month_key, product_key, warehouse_key)
);

-- Monthly snapshot: what was in stock at end of each month
INSERT INTO fact_monthly_inventory VALUES
(202606, 101, 1, 500, 250),
(202606, 102, 1, 100, 80),
(202607, 101, 1, 400, 200);

Accumulating Snapshot Fact Table

One row per process (not time period). Updated as the process progresses.

CREATE TABLE fact_order_fulfillment (
    order_id BIGINT PRIMARY KEY,
    order_date_key INT REFERENCES dim_date(date_key),
    payment_date_key INT REFERENCES dim_date(date_key),
    ship_date_key INT REFERENCES dim_date(date_key),
    delivery_date_key INT REFERENCES dim_date(date_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    order_amount DECIMAL(10,2),
    days_to_payment INT,
    days_to_delivery INT
);

-- Single row per order, updated as it progresses
INSERT INTO fact_order_fulfillMENT 
    (order_id, order_date_key, customer_key, order_amount)
VALUES (5001, 20260601, 1001, 250.00);

-- When payment received
UPDATE fact_order_fulfillMENT
SET payment_date_key = 20260602,
    days_to_payment = 1
WHERE order_id = 5001;

-- When shipped
UPDATE fact_order_fulfillMENT
SET ship_date_key = 20260603
WHERE order_id = 5001;

Conformed Dimensions

A conformed dimension is a dimension that means the same thing across multiple fact tables. The dim_date table works for sales, inventory, and HR facts — it’s conformed.

-- Conformed date dimension used by all fact tables
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    full_date DATE NOT NULL,
    year INT NOT NULL,
    quarter INT NOT NULL,
    month INT NOT NULL,
    month_name VARCHAR(20),
    week INT NOT NULL,
    day_of_week INT NOT NULL,
    is_weekend BOOLEAN,
    is_holiday BOOLEAN
);

-- One row per date — shared across ALL fact tables
INSERT INTO dim_date VALUES
(20260601, '2026-06-01', 2026, 2, 6, 'June', 23, 1, FALSE, FALSE),
(20260602, '2026-06-02', 2026, 2, 6, 'June', 23, 2, FALSE, FALSE);

Without conformed dimensions: Sales fact uses sale_date (VARCHAR), inventory fact uses inv_month (INT) — joining across processes is impossible or error-prone.

With conformed dimensions: Every fact joins to dim_date.date_key. You can compare sales vs inventory trends in a single query.

Bridge Tables and Junk Dimensions

Bridge Tables

Solve many-to-many relationships between facts and dimensions.

-- Bridge table: an order can have multiple promotions
CREATE TABLE bridge_order_promotion (
    order_id BIGINT REFERENCES fact_order(order_id),
    promotion_key INT REFERENCES dim_promotion(promotion_key),
    discount_amount DECIMAL(10,2),
    PRIMARY KEY (order_id, promotion_key)
);

-- An order used both "Summer Sale" and "Loyalty Bonus"
INSERT INTO bridge_order_promotion VALUES
(5001, 1, 25.00),
(5001, 2, 10.00);

Junk Dimensions

A junk dimension groups low-cardinality flags and indicators into a single dimension table instead of adding 10 tiny dimensions.

-- Instead of 10 separate dimensions for flags...
CREATE TABLE dim_junk_order (
    junk_key INT PRIMARY KEY,
    is_new_customer BOOLEAN,
    is_verified BOOLEAN,
    is_expedited_shipping BOOLEAN,
    payment_method VARCHAR(20),
    order_source VARCHAR(20)
);

-- Pre-generate all combinations
INSERT INTO dim_junk_order VALUES
(1, TRUE, TRUE, FALSE, 'credit_card', 'web'),
(2, TRUE, TRUE, TRUE, 'paypal', 'mobile'),
(3, FALSE, TRUE, FALSE, 'credit_card', 'web'),
(4, FALSE, FALSE, FALSE, 'debit', 'in_store');

-- Fact table just references a single junk_key
ALTER TABLE fact_order ADD COLUMN junk_key INT REFERENCES dim_junk_order(junk_key);

Advanced Data Modeling Architecture


flowchart TB
  subgraph "Source Systems"
    OLT1[(OLTP
Orders DB)]
    OLT2[(OLTP
Inventory)]
    CRM[(CRM System)]
  end
  subgraph "Staging"
    STG1[stg_orders]
    STG2[stg_products]
    STG3[stg_customers]
  end
  subgraph "Dimensional Model"
    subgraph "Dimensions"
      DP[dim_product
SCD Type 2]
      DC[dim_customer
SCD Type 2]
      DD[dim_date
Conformed]
      DJ[dim_junk]
    end
    subgraph "Fact Tables"
      FT[fact_transaction
Sales]
      FS[fact_snapshot
Inventory]
      FA[fact_accumulating
Fulfillment]
    end
    BT[bridge_order_promo]
  end
  OLT1 --> STG1
  OLT1 --> STG2
  CRM --> STG3
  STG1 --> FT
  STG1 --> FA
  STG2 --> DP
  STG3 --> DC
  FT --> DD
  FS --> DD
  FA --> DD
  FT --> DP
  FT --> DC
  FS --> DP
  FA --> DC
  FT --> BT
  FT --> DJ

Common Data Modeling Mistakes

1. Deeply Nested Snowflake Schemas

Star schemas are fast because they’re denormalized. Adding multiple levels of dimension references (e.g., product -> category -> department -> division) creates snowflake schemas that require 5+ joins. Flatten into a single dimension table or use bridge tables.

2. Ignoring SCD Strategy Until Migration

Choosing SCD Type 1, then realizing you need historical reporting means rewriting the entire dimension. Decide SCD strategy per column during design. For audit-critical attributes (salary, address), use Type 2 from the start.

3. Using SCD Type 2 for Everything

Type 2 on all 30 columns of a customer dimension creates exponential row growth. A customer who changes their phone number 5 times generates 5 rows, even though only the phone changed. Split volatile attributes into a separate dimension or use Type 1 for low-importance fields.

4. Fact Tables Without Granularity Defined

“Sum of sales” is meaningless if you don’t know what one row represents. One row per line item? Per invoice? Per store per day? Document grain in the table comment: COMMENT ON TABLE fact_sales IS 'Grain: one row per line item per order'.

5. Over-Normalizing Dimensions

Storing city_id in dim_customer pointing to dim_city, which points to dim_state, which points to dim_country. This is 3NF, not dimensional modeling. Flatten: dim_customer has city, state, country directly.

6. Mixing Fact and Dimension Attributes

Adding order_amount to dim_order violates dimensional modeling principles. Measures go in fact tables, descriptive attributes go in dimensions. Mixing them creates confusion and inconsistent aggregates.

7. Not Testing with Real Query Patterns

A perfect logical model that doesn’t match how users query is useless. Before finalizing, write 5-10 common business queries against your schema. If a query needs 12 joins, reconsider the model.

Practice Questions

1. What is the fundamental difference between Kimball and Inmon methodologies?

Kimball uses bottom-up dimensional modeling (star schemas per business process). Inmon uses top-down normalized EDW (3NF) with departmental data marts on top. Kimball delivers faster initial value, Inmon requires more upfront investment.

2. When would you use SCD Type 2 instead of Type 1?

Use Type 2 when you need historical reporting — e.g., “How much did each customer spend while living in each city?” Type 1 overwrites history, so you lose the ability to attribute facts to historical dimension values.

3. What is the grain of a fact table and why does it matter?

Grain describes what one row represents. fact_sales grain might be “one row per line item per order.” Grain determines what queries are possible. Without defined grain, aggregations produce wrong results.

4. What is a conformed dimension and how does it enable cross-process analysis?

A conformed dimension (like dim_date) has identical meaning and content across all fact tables. It lets you join sales, inventory, and HR facts on the same date dimension to answer cross-functional questions.

5. Challenge: Design a dimensional model for a hospital system tracking patient visits, diagnoses, procedures, and billing. Include SCD for patient demographics, conformed dimensions for date and doctor, and a fact table for encounters.

Use dim_patient (SCD Type 2 for address, Type 1 for blood type), dim_date (conformed), dim_doctor (SCD Type 2 for department), dim_diagnosis (junk dimension for ICD codes), fact_encounter (grain: one row per patient visit), fact_billing (grain: one row per billed procedure), bridge_encounter_diagnosis (many-to-many: an encounter can have multiple diagnoses).

Mini Project: SCD Type 2 Implementation

# scd_type2.py
# Implement SCD Type 2 in Python
from datetime import date, datetime
import csv

class SCD2Dimension:
    """Manage a Type 2 slowly changing dimension."""
    
    def __init__(self, key_columns: list):
        self.key_columns = key_columns
        self.records = {}  # natural_key -> list of versions
    
    def insert(self, row: dict):
        """Insert a new record (first version)."""
        natural_key = tuple(row[c] for c in self.key_columns)
        version = {
            **row,
            "effective_date": date(2025, 1, 1),
            "end_date": date(9999, 12, 31),
            "is_current": True
        }
        self.records[natural_key] = [version]
    
    def update(self, natural_key: tuple, new_values: dict, change_date: date = None):
        """Apply an SCD Type 2 update."""
        if natural_key not in self.records:
            raise ValueError(f"Key {natural_key} not found")
        
        change_date = change_date or date.today()
        current = self.records[natural_key][-1]
        
        # Check if anything actually changed
        changed = any(current.get(k) != v for k, v in new_values.items())
        if not changed:
            return current
        
        # Expire the current version
        current["end_date"] = change_date
        current["is_current"] = False
        
        # Create new version
        new_version = {**current, **new_values}
        new_version["effective_date"] = change_date
        new_version["end_date"] = date(9999, 12, 31)
        new_version["is_current"] = True
        
        self.records[natural_key].append(new_version)
        return new_version
    
    def get_as_of(self, natural_key: tuple, as_of_date: date):
        """Get the record as of a specific date."""
        for version in reversed(self.records.get(natural_key, [])):
            if version["effective_date"] <= as_of_date < version["end_date"]:
                return version
        return None

# Usage
dim = SCD2Dimension(key_columns=["customer_id"])

dim.insert({"customer_id": 101, "name": "Alice", "city": "New York", "email": "alice@email.com"})
dim.insert({"customer_id": 102, "name": "Bob", "city": "Chicago", "email": "bob@email.com"})

print("After initial insert:")
print(dim.records[(101,)][-1])

# Alice moves to LA
dim.update((101,), {"city": "Los Angeles"}, change_date=date(2026, 6, 15))

print("\nAfter Alice moves (current):")
print(dim.records[(101,)][-1])

print("\nAlice's history:")
for v in dim.records[(101,)]:
    print(f"  {v['city']}: {v['effective_date']} to {v['end_date']} (current={v['is_current']})")

print(f"\nAlice's city on 2025-06-01: {dim.get_as_of((101,), date(2025, 6, 1))['city']}")
print(f"Alice's city on 2026-07-01: {dim.get_as_of((101,), date(2026, 7, 1))['city']}")

Expected output:

After initial insert:
{'customer_id': 101, 'name': 'Alice', 'city': 'New York', 'email': 'alice@email.com', 'effective_date': datetime.date(2025, 1, 1), 'end_date': datetime.date(9999, 12, 31), 'is_current': True}

After Alice moves (current):
{'customer_id': 101, 'name': 'Alice', 'city': 'Los Angeles', 'email': 'alice@email.com', 'effective_date': datetime.date(2026, 6, 15), 'end_date': datetime.date(9999, 12, 31), 'is_current': True}

Alice's history:
  New York: 2025-01-01 to 2026-06-15 (current=False)
  Los Angeles: 2026-06-15 to 9999-12-31 (current=True)

Alice's city on 2025-06-01: New York
Alice's city on 2026-07-01: Los Angeles

Related Concepts

Data Modeling Basics

Data Warehousing Guide

dbt (Data Build Tool)

Data Lakehouse Architecture

FAQ

What is the difference between a star schema and a snowflake schema?

A star schema has one fact table connected directly to denormalized dimension tables (one level deep). A snowflake schema normalizes dimensions into sub-dimensions (e.g., product -> category -> department). Stars are simpler and faster; snowflakes reduce redundancy but need more joins.

When should I use a junk dimension?

Use a junk dimension when you have 3-10 low-cardinality flags (is_new_customer, is_verified, payment_method) that are independent. Each flag has few values but combining them creates many unique rows. A junk dimension collects all these in one table, avoiding 5+ small dimensions.

How do bridge tables handle many-to-many relationships in dimensional models?

Bridge tables sit between a fact and a dimension when one fact row relates to multiple dimension rows (e.g., one order uses multiple promotions). They contain the fact key, dimension key, and any relationship-specific measures (e.g., discount amount per promotion).

What is the grain of an accumulating snapshot fact table?

The grain is one row per instance of a process (e.g., one row per order). Unlike periodic snapshots (one row per time period), accumulating facts are updated as the process progresses. The row is inserted when the process starts and updated at each milestone.

What’s Next

You now understand advanced data modeling patterns including Kimball vs Inmon, SCD types, fact table types, conformed dimensions, bridge tables, and junk dimensions. Next, explore dbt for implementing these patterns in SQL and data pipeline orchestration for productionizing your models.

Practice daily — Take a flat dataset and model it as a star schema with one fact and 3 dimensions
Build a project — Implement an SCD Type 2 dimension in dbt using snapshot strategies
Explore patterns — Map your current warehouse schema and identify where bridge tables or junk dimensions would improve it

Remember: every expert was once a beginner. Keep modeling!

Previous Data Pipeline Orchestration — Airflow, Prefect, and Dagster Guide Next Data Quality & Testing — Great Expectations, dbt Tests & Automated Validation

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Engineering