dbt Explained — SQL-First Data Transformations with dbt Core
dbt (data build tool) is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouse using SQL — treating transformations as code with testing, documentation, and version control.
What You’ll Learn
By the end of this tutorial, you’ll understand dbt’s core concepts: models, sources, tests, documentation, materializations (table, view, incremental), and how to build a complete dbt project for transforming raw orders.
Why dbt Matters
In the modern ELT workflow, you load raw data into the warehouse, then transform it there. dbt lets analysts write SELECT statements and handles the CREATE TABLE, INSERT, and incremental logic automatically. It also runs data tests, generates documentation, and tracks lineage. DodaTech uses dbt to transform raw Doda Browser event data into analytics-ready tables.
dbt Learning Path
flowchart LR
A[Data Warehousing] --> B[dbt]
A --> C[ETL Pipelines]
C --> D[Apache Airflow]
D --> B
B --> E{You Are Here}
E --> F[Models]
E --> G[Tests]
E --> H[Docs]
E --> I[Materializations]
What Is dbt?
Think of dbt like a cookbook for your data warehouse. Instead of manually writing CREATE TABLE, INSERT, and ALTER statements every time you want a new dataset, you define a recipe (a SQL model), and dbt executes it, manages dependencies, and serves the result.
dbt is often called “the T in ELT” — it handles the transformation stage after raw data has been loaded into the warehouse by tools like Fivetran or Airbyte.
Core dbt Concepts
Models
A model is a single .sql file with a SELECT statement. dbt wraps it in a CREATE VIEW or CREATE TABLE.
-- models/stg_orders.sql
-- Stage model: clean and standardize raw orders
with source as (
select * from {{ source('ecommerce', 'orders') }}
),
cleaned as (
select
order_id,
customer_id,
cast(order_date as date) as order_date,
cast(amount as decimal(10,2)) as amount,
lower(status) as status,
case
when status = 'completed' then 'delivered'
when status = 'shipped' then 'in_transit'
else status
end as delivery_status
from source
where order_id is not null
)
select * from cleanedSources
Sources tell dbt about the raw tables in your warehouse. They enable lineage tracking and freshness checks.
# models/sources.yml
version: 2
sources:
- name: ecommerce
database: raw_db
schema: public
tables:
- name: orders
description: "Raw orders from the e-commerce platform"
columns:
- name: order_id
description: "Primary key"
tests:
- unique
- not_null
- name: customers
description: "Raw customer records"
- name: web_analytics
database: raw_db
schema: web
tables:
- name: page_views
loaded_at_field: viewed_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}Materializations
| Type | What It Creates | When to Use |
|---|---|---|
| view | CREATE VIEW | Lightweight transformations, no performance concern |
| table | CREATE TABLE AS | Simple aggregations, rebuilt each run |
| incremental | MERGE / INSERT | Large tables, only new/changed records |
| ephemeral | CTE (common table expression) | Intermediate logic, don’t need persisted |
| materialized_view | Database materialized view | When platform supports auto-refresh |
Tests
dbt tests validate data quality. Built-in tests: unique, not_null, accepted_values, relationships. You can also write custom tests.
-- tests/assert_positive_order_amount.sql
-- Custom test: fail if any order has negative amount
select * from {{ ref('stg_orders') }}
where amount < 0Complete dbt Project Example
-- models/stg_customers.sql
-- Stage: clean customer data
with source as (
select * from {{ source('ecommerce', 'customers') }}
)
select
customer_id,
trim(full_name) as customer_name,
lower(email) as email,
city,
state,
country,
signup_date,
case
when total_orders >= 10 then 'vip'
when total_orders >= 1 then 'active'
else 'new'
end as customer_tier
from source-- models/dim_customers.sql
-- Dimension: customer details
select
customer_id,
customer_name,
email,
city,
state,
country,
customer_tier,
datediff('day', signup_date, current_date) as tenure_days
from {{ ref('stg_customers') }}-- models/fct_orders.sql
-- Fact: order transactions
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('dim_customers') }}
)
select
o.order_id,
o.customer_id,
o.order_date,
o.amount,
o.delivery_status,
c.customer_tier,
c.country,
date_trunc('month', o.order_date) as order_month
from orders o
left join customers c on o.customer_id = c.customer_id# dbt_project.yml
name: dodatech_analytics
version: '1.0'
config-version: 2
profile: dodatech_profile
model-paths: ["models"]
test-paths: ["tests"]
analysis-paths: ["analyses"]
models:
dodatech_analytics:
staging:
+materialized: view
+schema: staging
marts:
+materialized: table
+schema: analytics
dims:
+materialized: table
facts:
+materialized: incremental
+unique_key: order_id# dbt_demo.py
# Simulate dbt model compilation and execution
import json
class dbtModel:
def __init__(self, name, sql, materialization='view', refs=None):
self.name = name
self.sql = sql
self.materialization = materialization
self.refs = refs or []
def compile(self):
"""Simulate dbt's SQL compilation (ref resolution)."""
compiled = self.sql
for ref in self.refs:
placeholder = "{{ ref('" + ref + "') }}"
compiled = compiled.replace(placeholder, f"dodatech_analytics.{ref}")
return compiled
def execute(self, data):
"""Simulate dbt model execution."""
print(f"[dbt] Running model: {self.name}")
print(f"[dbt] Materialization: {self.materialization}")
print(f"[dbt] Compiled SQL ({len(self.compile())} chars):")
print(f" {self.compile()[:100]}...")
print(f"[dbt] Result: {len(data)} rows {'inserted' if self.materialization == 'incremental' else 'created'}")
return data
# Define models programmatically
stg_orders_sql = """
select order_id, customer_id, order_date,
cast(amount as decimal(10,2)) as amount, status
from raw_orders
where order_id is not null
"""
fct_orders_sql = """
select o.order_id, o.customer_id, o.order_date,
o.amount, o.status, c.customer_tier
from {{ ref('stg_orders') }} o
left join {{ ref('dim_customers') }} c
on o.customer_id = c.customer_id
"""
models = {
"stg_orders": dbtModel("stg_orders", stg_orders_sql, "view"),
"dim_customers": dbtModel("dim_customers", "select * from raw_customers", "table"),
"fct_orders": dbtModel("fct_orders", fct_orders_sql, "incremental", refs=["stg_orders", "dim_customers"]),
}
# Simulate dbt run
print("=== dbt Run ===\n")
sample_data = [{"order_id": i, "amount": i * 100.0} for i in range(1, 6)]
for name, model in models.items():
model.execute(sample_data)
print("\n=== dbt Lineage ===")
print("""
raw_orders ──> stg_orders ──┐
├──> fct_orders
raw_customers ─> dim_customers ┘
""")Expected output:
=== dbt Run ===
[dbt] Running model: stg_orders
[dbt] Materialization: view
[dbt] Compiled SQL (125 chars):
select order_id, customer_id, order_date,
cast(amount as decimal(10,2)) as amount, status
from raw_orders...
[dbt] Result: 5 rows created
[dbt] Running model: dim_customers
[dbt] Materialization: table
[dbt] Compiled SQL (24 chars):
select * from raw_customers...
[dbt] Result: 5 rows created
[dbt] Running model: fct_orders
[dbt] Materialization: incremental
[dbt] Compiled SQL (249 chars):
select o.order_id, o.customer_id, o.order_date,
o.amount, o.status, c.customer_tier
from dodatech_analytics.stg_orders o
left join...
[dbt] Result: 5 rows inserted
=== dbt Lineage ===
raw_orders ──> stg_orders ──┐
├──> fct_orders
raw_customers ─> dim_customers ┘dbt in Production
Structure
dodatech-analytics/
├── dbt_project.yml
├── profiles.yml # Warehouse connection config
├── models/
│ ├── staging/ # Raw → clean
│ │ ├── stg_orders.sql
│ │ └── sources.yml
│ ├── marts/
│ │ ├── dim_customers.sql
│ │ └── fct_orders.sql
│ └── docs/
│ └── overview.md
├── tests/
│ └── assert_positive_amount.sql
└── analyses/
└── revenue_trends.sqlCommands
# Run all models
dbt run
# Run specific model with dependencies
dbt run --model fct_orders+
# Run tests
dbt test
# Generate documentation
dbt docs generate
dbt docs serve # Opens docs in browser
# Run with full refresh (ignore incremental)
dbt run --full-refresh
# Debug connection
dbt debugCommon dbt Mistakes
1. Not Using Sources
Hard-coding table references (FROM raw_db.orders) breaks when databases change. Always use {{ source('ecommerce', 'orders') }} for raw data.
2. Over-Complicated Models
A 200-line SQL model is hard to debug. Break complex logic into smaller intermediate models. Each model should do one thing well.
3. Ignoring Tests
Without tests, bad data silently flows downstream. Add not_null, unique, and accepted_values tests to every source and model column.
4. Forgetting --full-refresh for Incremental Models
When you change the schema of an incremental model, existing rows don’t change. Run dbt run --full-refresh to rebuild.
5. Not Using ref for Dependencies
Using direct table names instead of {{ ref('model_name') }} breaks dbt’s lineage tracking and dependency resolution.
6. No Documentation
Skipping docs.md and column descriptions means no one knows what a column means. Document as you model.
Practice Questions
1. What is the difference between a dbt model materialized as ‘view’ vs ’table’?
A view creates a virtual table (runs the query each time), uses no storage, and always reflects latest data. A table persists results to disk, is faster for reads, but must be rebuilt.
2. What is the purpose of dbt sources?
Sources tell dbt about raw tables in the warehouse. They enable lineage tracking, freshness checks, and testing on upstream data.
3. How does incremental materialization work in dbt?
dbt first creates the table, then on subsequent runs only inserts/merges new records using the unique_key and a WHERE filter (typically on updated_at > (select max(updated_at) from target_table)).
4. What is the dbt ref function and why is it important?
{{ ref('model_name') }} resolves to the actual table/view name of another dbt model. It automatically handles dependency ordering, schema prefixes, and enables the lineage graph.
5. Challenge: Design a dbt project for a subscription analytics pipeline with raw tables for subscriptions, payments, and users, and output monthly churn and MRR models.
Models: stg_subscriptions, stg_payments, dim_users, fct_monthly_recurring_revenue (incremental, aggregated by month), int_churn_analysis. Tests on unique keys and foreign key relationships.
Mini Project: dbt Source Freshness Checker
# freshness_check.py
# Simulate dbt source freshness checking
from datetime import datetime, timedelta
import json
class SourceFreshness:
def __init__(self, max_hours):
self.max_hours = max_hours
def check(self, source_name, table_name, last_loaded, current_time=None):
if current_time is None:
current_time = datetime.now()
elapsed = (current_time - last_loaded).total_seconds() / 3600
status = "PASS" if elapsed <= self.max_hours else "FAIL"
return {
"source": source_name,
"table": table_name,
"last_loaded": last_loaded.isoformat(),
"elapsed_hours": round(elapsed, 1),
"max_hours": self.max_hours,
"status": status,
}
checker = SourceFreshness(max_hours=24)
tables = [
("ecommerce", "orders", datetime.now() - timedelta(hours=2)),
("ecommerce", "customers", datetime.now() - timedelta(hours=18)),
("web_analytics", "page_views", datetime.now() - timedelta(hours=30)),
]
print(f"{'Source':<20} {'Table':<15} {'Last Loaded':<25} {'Elapsed':<10} {'Max':<10} {'Status'}")
print("-" * 90)
for source, table, last_loaded in tables:
result = checker.check(source, table, last_loaded)
print(f"{result['source']:<20} {result['table']:<15} {result['last_loaded']:<25} {result['elapsed_hours']:<10}h {result['max_hours']:<10}h {result['status']}")Expected output:
Source Table Last Loaded Elapsed Max Status
---------------------------------------------------------------------------------------------
ecommerce orders 2026-06-14T08:00:00 2.0h 24h PASS
ecommerce customers 2026-06-13T16:00:00 18.0h 24h PASS
web_analytics page_views 2026-06-12T04:00:00 30.0h 24h FAILRelated Concepts
What’s Next
You now understand dbt fundamentals! Next, explore data modeling to design better warehouse schemas, and learn Apache Airflow for orchestrating dbt runs in production.
- Practice daily — Convert existing SQL transforms into dbt models
- Build a project — Run dbt Core with DuckDB locally on a CSV dataset
- Explore related topics — Check out dbt exposures, metrics, and docs generation
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro