Advanced Data Modeling — Kimball vs Inmon, SCD Types, Fact Tables
Advanced data modeling covers the methodologies and patterns used to design scalable, maintainable data warehouses — from dimensional modeling choices to handling slowly changing dimensions and complex fact tables.
What You’ll Learn
By the end of this tutorial, you’ll understand Kimball vs Inmon methodologies, implement SCD Types 1/2/3, design fact tables (periodic snapshot, accumulating snapshot, transactional), and know when to use bridge tables, conformed dimensions, and junk dimensions.
Why Advanced Data Modeling Matters
Poor data modeling leads to slow queries, confusing schemas, and brittle pipelines. A retail company using a single flat table for all sales data ends up with 200-column tables, duplicate data, and queries that take 30 minutes. Proper dimensional modeling reduces query time by 90% and makes the schema understandable to business users. Doda Browser uses Kimball-style star schemas for its analytics platform, processing 10M+ events daily with sub-second query latency.
Data Modeling Learning Path
flowchart LR
A[Data Modeling Basics] --> B[Advanced Data Modeling]
B --> C{You Are Here}
C --> D[Kimball vs Inmon]
C --> E[SCD Types]
C --> F[Fact Tables]
D --> G[Dimensional Modeling]
E --> H[Type 1/2/3]
F --> I[Snapshot/Agg/Transaction]
Kimball vs Inmon Methodologies
Kimball (Bottom-Up)
Ralph Kimball’s approach builds the warehouse one business process at a time using star schemas (fact + dimension tables). Start with the most important process (e.g., sales), model it as a star schema, and add more processes incrementally.
Inmon (Top-Down)
Bill Inmon’s approach builds a normalized Enterprise Data Warehouse (EDW) first (3NF), then creates data marts for specific departments. Everything goes through the EDW first.
| Aspect | Kimball | Inmon |
|---|---|---|
| Approach | Bottom-up (process by process) | Top-down (enterprise first) |
| Schema | Dimensional (star) | Normalized (3NF) |
| Development time | Fast initial delivery | Slow initial delivery |
| Redundancy | Acceptable (denormalized) | Minimal (normalized) |
| Business user friendly | Yes (intuitive) | No (complex joins) |
| Best for | Agile teams, mid-size warehouses | Large enterprises, regulatory environments |
-- Kimball-style star schema
CREATE TABLE fact_sales (
sale_id BIGINT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
customer_key INT REFERENCES dim_customer(customer_key),
store_key INT REFERENCES dim_store(store_key),
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
discount_amount DECIMAL(10,2) DEFAULT 0
);
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR(50),
product_name VARCHAR(200),
category VARCHAR(100),
brand VARCHAR(100),
current_price DECIMAL(10,2)
);
-- Query: sales by category
SELECT p.category, SUM(f.quantity * f.unit_price) AS revenue
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2026
GROUP BY p.category;-- Inmon-style 3NF model
CREATE TABLE order_header (
order_id BIGINT PRIMARY KEY,
customer_id BIGINT REFERENCES customer(customer_id),
order_date DATE NOT NULL,
store_id INT REFERENCES store(store_id),
total_amount DECIMAL(12,2),
status VARCHAR(20)
);
CREATE TABLE order_line_item (
line_item_id BIGINT PRIMARY KEY,
order_id BIGINT REFERENCES order_header(order_id),
product_id BIGINT REFERENCES product(product_id),
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL
);
-- Query: sales by category (3NF — more joins)
SELECT p.category, SUM(li.quantity * li.unit_price) AS revenue
FROM order_line_item li
JOIN order_header oh ON li.order_id = oh.order_id
JOIN product p ON li.product_id = p.product_id
WHERE EXTRACT(YEAR FROM oh.order_date) = 2026
GROUP BY p.category;Slowly Changing Dimensions (SCD)
Dimensions change over time. A customer moves to a new city, a product changes price, an employee gets promoted. SCD types define how to handle these changes.
SCD Type 1: Overwrite
Overwrite the old value with the new. No history preserved.
-- SCD Type 1: update in place
-- Customer "Alice" changes email
UPDATE dim_customer
SET email = 'alice.new@email.com',
last_updated = CURRENT_DATE
WHERE customer_id = 101;
-- Old email is lost after this update
SELECT * FROM dim_customer WHERE customer_id = 101;
-- Result: shows only the new email| customer_id | name | city | |
|---|---|---|---|
| 101 | Alice | alice.new@email.com | NY |
Use when: Changes are corrections or non-critical attributes. No audit trail needed.
SCD Type 2: Add New Row
Preserve history by adding a new row with effective dates.
-- SCD Type 2: new row for changed attribute
-- Before change
INSERT INTO dim_customer_scd2
(customer_id, name, email, city, effective_date, end_date, is_current)
VALUES
(101, 'Alice', 'alice@email.com', 'New York', '2025-01-01', '2026-06-15', FALSE);
-- After address change
INSERT INTO dim_customer_scd2
(customer_id, name, email, city, effective_date, end_date, is_current)
VALUES
(101, 'Alice', 'alice@email.com', 'Los Angeles', '2026-06-16', '9999-12-31', TRUE);
-- NOW the old address is expired, new address is current| customer_id | name | city | effective_date | end_date | is_current | |
|---|---|---|---|---|---|---|
| 101 | Alice | alice@email.com | New York | 2025-01-01 | 2026-06-15 | FALSE |
| 101 | Alice | alice@email.com | Los Angeles | 2026-06-16 | 9999-12-31 | TRUE |
Use when: You need historical reporting (e.g., “Which city did customers live in when they made a purchase?”).
SCD Type 3: Add Previous Value Column
Keep both old and new values in the same row using additional columns.
-- SCD Type 3: add previous value column
ALTER TABLE dim_customer_scd3 ADD COLUMN previous_city VARCHAR(100);
ALTER TABLE dim_customer_scd3 ADD COLUMN city_changed_date DATE;
-- When customer changes city
UPDATE dim_customer_scd3
SET previous_city = city,
city = 'Los Angeles',
city_changed_date = '2026-06-16'
WHERE customer_id = 101;| customer_id | name | city | previous_city | city_changed_date |
|---|---|---|---|---|
| 101 | Alice | Los Angeles | New York | 2026-06-16 |
Use when: You need to track at most one previous value per attribute. Common for product price changes.
SCD Comparison
| Type | History | Storage | Query complexity | Use case |
|---|---|---|---|---|
| 1 | None | Minimal | Simple | Bug fixes, non-important attributes |
| 2 | Full | High | Moderate | Audit trails, regulatory compliance |
| 3 | One previous | Low | Simple | “What was the previous value?” |
Fact Tables
Fact tables store measures (numerics that can be aggregated) and foreign keys to dimensions.
Transaction Fact Table
One row per event. The most common type.
CREATE TABLE fact_transaction_sales (
sale_id BIGINT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
customer_key INT REFERENCES dim_customer(customer_key),
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
discount DECIMAL(10,2) DEFAULT 0
);
-- Row per individual sale
INSERT INTO fact_transaction_sales VALUES
(1, 20260601, 101, 1001, 2, 25.00, 0),
(2, 20260601, 102, 1002, 1, 50.00, 5.00),
(3, 20260602, 101, 1001, 3, 25.00, 2.50);Periodic Snapshot Fact Table
One row per period. Captures state at regular intervals.
CREATE TABLE fact_monthly_inventory (
month_key INT REFERENCES dim_month(month_key),
product_key INT REFERENCES dim_product(product_key),
warehouse_key INT REFERENCES dim_warehouse(warehouse_key),
quantity_on_hand INT NOT NULL,
quantity_sold INT NOT NULL,
PRIMARY KEY (month_key, product_key, warehouse_key)
);
-- Monthly snapshot: what was in stock at end of each month
INSERT INTO fact_monthly_inventory VALUES
(202606, 101, 1, 500, 250),
(202606, 102, 1, 100, 80),
(202607, 101, 1, 400, 200);Accumulating Snapshot Fact Table
One row per process (not time period). Updated as the process progresses.
CREATE TABLE fact_order_fulfillment (
order_id BIGINT PRIMARY KEY,
order_date_key INT REFERENCES dim_date(date_key),
payment_date_key INT REFERENCES dim_date(date_key),
ship_date_key INT REFERENCES dim_date(date_key),
delivery_date_key INT REFERENCES dim_date(date_key),
customer_key INT REFERENCES dim_customer(customer_key),
order_amount DECIMAL(10,2),
days_to_payment INT,
days_to_delivery INT
);
-- Single row per order, updated as it progresses
INSERT INTO fact_order_fulfillMENT
(order_id, order_date_key, customer_key, order_amount)
VALUES (5001, 20260601, 1001, 250.00);
-- When payment received
UPDATE fact_order_fulfillMENT
SET payment_date_key = 20260602,
days_to_payment = 1
WHERE order_id = 5001;
-- When shipped
UPDATE fact_order_fulfillMENT
SET ship_date_key = 20260603
WHERE order_id = 5001;Conformed Dimensions
A conformed dimension is a dimension that means the same thing across multiple fact tables. The dim_date table works for sales, inventory, and HR facts — it’s conformed.
-- Conformed date dimension used by all fact tables
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE NOT NULL,
year INT NOT NULL,
quarter INT NOT NULL,
month INT NOT NULL,
month_name VARCHAR(20),
week INT NOT NULL,
day_of_week INT NOT NULL,
is_weekend BOOLEAN,
is_holiday BOOLEAN
);
-- One row per date — shared across ALL fact tables
INSERT INTO dim_date VALUES
(20260601, '2026-06-01', 2026, 2, 6, 'June', 23, 1, FALSE, FALSE),
(20260602, '2026-06-02', 2026, 2, 6, 'June', 23, 2, FALSE, FALSE);Without conformed dimensions: Sales fact uses sale_date (VARCHAR), inventory fact uses inv_month (INT) — joining across processes is impossible or error-prone.
With conformed dimensions: Every fact joins to dim_date.date_key. You can compare sales vs inventory trends in a single query.
Bridge Tables and Junk Dimensions
Bridge Tables
Solve many-to-many relationships between facts and dimensions.
-- Bridge table: an order can have multiple promotions
CREATE TABLE bridge_order_promotion (
order_id BIGINT REFERENCES fact_order(order_id),
promotion_key INT REFERENCES dim_promotion(promotion_key),
discount_amount DECIMAL(10,2),
PRIMARY KEY (order_id, promotion_key)
);
-- An order used both "Summer Sale" and "Loyalty Bonus"
INSERT INTO bridge_order_promotion VALUES
(5001, 1, 25.00),
(5001, 2, 10.00);Junk Dimensions
A junk dimension groups low-cardinality flags and indicators into a single dimension table instead of adding 10 tiny dimensions.
-- Instead of 10 separate dimensions for flags...
CREATE TABLE dim_junk_order (
junk_key INT PRIMARY KEY,
is_new_customer BOOLEAN,
is_verified BOOLEAN,
is_expedited_shipping BOOLEAN,
payment_method VARCHAR(20),
order_source VARCHAR(20)
);
-- Pre-generate all combinations
INSERT INTO dim_junk_order VALUES
(1, TRUE, TRUE, FALSE, 'credit_card', 'web'),
(2, TRUE, TRUE, TRUE, 'paypal', 'mobile'),
(3, FALSE, TRUE, FALSE, 'credit_card', 'web'),
(4, FALSE, FALSE, FALSE, 'debit', 'in_store');
-- Fact table just references a single junk_key
ALTER TABLE fact_order ADD COLUMN junk_key INT REFERENCES dim_junk_order(junk_key);Advanced Data Modeling Architecture
flowchart TB
subgraph "Source Systems"
OLT1[(OLTP
Orders DB)]
OLT2[(OLTP
Inventory)]
CRM[(CRM System)]
end
subgraph "Staging"
STG1[stg_orders]
STG2[stg_products]
STG3[stg_customers]
end
subgraph "Dimensional Model"
subgraph "Dimensions"
DP[dim_product
SCD Type 2]
DC[dim_customer
SCD Type 2]
DD[dim_date
Conformed]
DJ[dim_junk]
end
subgraph "Fact Tables"
FT[fact_transaction
Sales]
FS[fact_snapshot
Inventory]
FA[fact_accumulating
Fulfillment]
end
BT[bridge_order_promo]
end
OLT1 --> STG1
OLT1 --> STG2
CRM --> STG3
STG1 --> FT
STG1 --> FA
STG2 --> DP
STG3 --> DC
FT --> DD
FS --> DD
FA --> DD
FT --> DP
FT --> DC
FS --> DP
FA --> DC
FT --> BT
FT --> DJ
Common Data Modeling Mistakes
1. Deeply Nested Snowflake Schemas
Star schemas are fast because they’re denormalized. Adding multiple levels of dimension references (e.g., product -> category -> department -> division) creates snowflake schemas that require 5+ joins. Flatten into a single dimension table or use bridge tables.
2. Ignoring SCD Strategy Until Migration
Choosing SCD Type 1, then realizing you need historical reporting means rewriting the entire dimension. Decide SCD strategy per column during design. For audit-critical attributes (salary, address), use Type 2 from the start.
3. Using SCD Type 2 for Everything
Type 2 on all 30 columns of a customer dimension creates exponential row growth. A customer who changes their phone number 5 times generates 5 rows, even though only the phone changed. Split volatile attributes into a separate dimension or use Type 1 for low-importance fields.
4. Fact Tables Without Granularity Defined
“Sum of sales” is meaningless if you don’t know what one row represents. One row per line item? Per invoice? Per store per day? Document grain in the table comment: COMMENT ON TABLE fact_sales IS 'Grain: one row per line item per order'.
5. Over-Normalizing Dimensions
Storing city_id in dim_customer pointing to dim_city, which points to dim_state, which points to dim_country. This is 3NF, not dimensional modeling. Flatten: dim_customer has city, state, country directly.
6. Mixing Fact and Dimension Attributes
Adding order_amount to dim_order violates dimensional modeling principles. Measures go in fact tables, descriptive attributes go in dimensions. Mixing them creates confusion and inconsistent aggregates.
7. Not Testing with Real Query Patterns
A perfect logical model that doesn’t match how users query is useless. Before finalizing, write 5-10 common business queries against your schema. If a query needs 12 joins, reconsider the model.
Practice Questions
1. What is the fundamental difference between Kimball and Inmon methodologies?
Kimball uses bottom-up dimensional modeling (star schemas per business process). Inmon uses top-down normalized EDW (3NF) with departmental data marts on top. Kimball delivers faster initial value, Inmon requires more upfront investment.
2. When would you use SCD Type 2 instead of Type 1?
Use Type 2 when you need historical reporting — e.g., “How much did each customer spend while living in each city?” Type 1 overwrites history, so you lose the ability to attribute facts to historical dimension values.
3. What is the grain of a fact table and why does it matter?
Grain describes what one row represents. fact_sales grain might be “one row per line item per order.” Grain determines what queries are possible. Without defined grain, aggregations produce wrong results.
4. What is a conformed dimension and how does it enable cross-process analysis?
A conformed dimension (like dim_date) has identical meaning and content across all fact tables. It lets you join sales, inventory, and HR facts on the same date dimension to answer cross-functional questions.
5. Challenge: Design a dimensional model for a hospital system tracking patient visits, diagnoses, procedures, and billing. Include SCD for patient demographics, conformed dimensions for date and doctor, and a fact table for encounters.
Use dim_patient (SCD Type 2 for address, Type 1 for blood type), dim_date (conformed), dim_doctor (SCD Type 2 for department), dim_diagnosis (junk dimension for ICD codes), fact_encounter (grain: one row per patient visit), fact_billing (grain: one row per billed procedure), bridge_encounter_diagnosis (many-to-many: an encounter can have multiple diagnoses).
Mini Project: SCD Type 2 Implementation
# scd_type2.py
# Implement SCD Type 2 in Python
from datetime import date, datetime
import csv
class SCD2Dimension:
"""Manage a Type 2 slowly changing dimension."""
def __init__(self, key_columns: list):
self.key_columns = key_columns
self.records = {} # natural_key -> list of versions
def insert(self, row: dict):
"""Insert a new record (first version)."""
natural_key = tuple(row[c] for c in self.key_columns)
version = {
**row,
"effective_date": date(2025, 1, 1),
"end_date": date(9999, 12, 31),
"is_current": True
}
self.records[natural_key] = [version]
def update(self, natural_key: tuple, new_values: dict, change_date: date = None):
"""Apply an SCD Type 2 update."""
if natural_key not in self.records:
raise ValueError(f"Key {natural_key} not found")
change_date = change_date or date.today()
current = self.records[natural_key][-1]
# Check if anything actually changed
changed = any(current.get(k) != v for k, v in new_values.items())
if not changed:
return current
# Expire the current version
current["end_date"] = change_date
current["is_current"] = False
# Create new version
new_version = {**current, **new_values}
new_version["effective_date"] = change_date
new_version["end_date"] = date(9999, 12, 31)
new_version["is_current"] = True
self.records[natural_key].append(new_version)
return new_version
def get_as_of(self, natural_key: tuple, as_of_date: date):
"""Get the record as of a specific date."""
for version in reversed(self.records.get(natural_key, [])):
if version["effective_date"] <= as_of_date < version["end_date"]:
return version
return None
# Usage
dim = SCD2Dimension(key_columns=["customer_id"])
dim.insert({"customer_id": 101, "name": "Alice", "city": "New York", "email": "alice@email.com"})
dim.insert({"customer_id": 102, "name": "Bob", "city": "Chicago", "email": "bob@email.com"})
print("After initial insert:")
print(dim.records[(101,)][-1])
# Alice moves to LA
dim.update((101,), {"city": "Los Angeles"}, change_date=date(2026, 6, 15))
print("\nAfter Alice moves (current):")
print(dim.records[(101,)][-1])
print("\nAlice's history:")
for v in dim.records[(101,)]:
print(f" {v['city']}: {v['effective_date']} to {v['end_date']} (current={v['is_current']})")
print(f"\nAlice's city on 2025-06-01: {dim.get_as_of((101,), date(2025, 6, 1))['city']}")
print(f"Alice's city on 2026-07-01: {dim.get_as_of((101,), date(2026, 7, 1))['city']}")Expected output:
After initial insert:
{'customer_id': 101, 'name': 'Alice', 'city': 'New York', 'email': 'alice@email.com', 'effective_date': datetime.date(2025, 1, 1), 'end_date': datetime.date(9999, 12, 31), 'is_current': True}
After Alice moves (current):
{'customer_id': 101, 'name': 'Alice', 'city': 'Los Angeles', 'email': 'alice@email.com', 'effective_date': datetime.date(2026, 6, 15), 'end_date': datetime.date(9999, 12, 31), 'is_current': True}
Alice's history:
New York: 2025-01-01 to 2026-06-15 (current=False)
Los Angeles: 2026-06-15 to 9999-12-31 (current=True)
Alice's city on 2025-06-01: New York
Alice's city on 2026-07-01: Los AngelesRelated Concepts
FAQ
What’s Next
You now understand advanced data modeling patterns including Kimball vs Inmon, SCD types, fact table types, conformed dimensions, bridge tables, and junk dimensions. Next, explore dbt for implementing these patterns in SQL and data pipeline orchestration for productionizing your models.
- Practice daily — Take a flat dataset and model it as a star schema with one fact and 3 dimensions
- Build a project — Implement an SCD Type 2 dimension in dbt using snapshot strategies
- Explore patterns — Map your current warehouse schema and identify where bridge tables or junk dimensions would improve it
Remember: every expert was once a beginner. Keep modeling!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro