Learn dbt Explained — SQL-First Data Transformations with dbt Core

dbt Explained — SQL-First Data Transformations with dbt Core

DodaTech Updated Jun 15, 2026 8 min read

dbt (data build tool) is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouse using SQL — treating transformations as code with testing, documentation, and version control.

What You’ll Learn

By the end of this tutorial, you’ll understand dbt’s core concepts: models, sources, tests, documentation, materializations (table, view, incremental), and how to build a complete dbt project for transforming raw orders.

Why dbt Matters

In the modern ELT workflow, you load raw data into the warehouse, then transform it there. dbt lets analysts write SELECT statements and handles the CREATE TABLE, INSERT, and incremental logic automatically. It also runs data tests, generates documentation, and tracks lineage. DodaTech uses dbt to transform raw Doda Browser event data into analytics-ready tables.

dbt Learning Path


flowchart LR
  A[Data Warehousing] --> B[dbt]
  A --> C[ETL Pipelines]
  C --> D[Apache Airflow]
  D --> B
  B --> E{You Are Here}
  E --> F[Models]
  E --> G[Tests]
  E --> H[Docs]
  E --> I[Materializations]

Prerequisites: Strong SQL skills, understanding of ETL and data warehousing. Familiarity with YAML configuration helps.

What Is dbt?

Think of dbt like a cookbook for your data warehouse. Instead of manually writing CREATE TABLE, INSERT, and ALTER statements every time you want a new dataset, you define a recipe (a SQL model), and dbt executes it, manages dependencies, and serves the result.

dbt is often called “the T in ELT” — it handles the transformation stage after raw data has been loaded into the warehouse by tools like Fivetran or Airbyte.

Core dbt Concepts

Models

A model is a single .sql file with a SELECT statement. dbt wraps it in a CREATE VIEW or CREATE TABLE.

-- models/stg_orders.sql
-- Stage model: clean and standardize raw orders
with source as (
    select * from {{ source('ecommerce', 'orders') }}
),

cleaned as (
    select
        order_id,
        customer_id,
        cast(order_date as date) as order_date,
        cast(amount as decimal(10,2)) as amount,
        lower(status) as status,
        case
            when status = 'completed' then 'delivered'
            when status = 'shipped' then 'in_transit'
            else status
        end as delivery_status
    from source
    where order_id is not null
)

select * from cleaned

Sources

Sources tell dbt about the raw tables in your warehouse. They enable lineage tracking and freshness checks.

# models/sources.yml
version: 2

sources:
  - name: ecommerce
    database: raw_db
    schema: public
    tables:
      - name: orders
        description: "Raw orders from the e-commerce platform"
        columns:
          - name: order_id
            description: "Primary key"
            tests:
              - unique
              - not_null
      - name: customers
        description: "Raw customer records"
  - name: web_analytics
    database: raw_db
    schema: web
    tables:
      - name: page_views
        loaded_at_field: viewed_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

Materializations

Type	What It Creates	When to Use
view	CREATE VIEW	Lightweight transformations, no performance concern
table	CREATE TABLE AS	Simple aggregations, rebuilt each run
incremental	MERGE / INSERT	Large tables, only new/changed records
ephemeral	CTE (common table expression)	Intermediate logic, don’t need persisted
materialized_view	Database materialized view	When platform supports auto-refresh

Tests

dbt tests validate data quality. Built-in tests: unique, not_null, accepted_values, relationships. You can also write custom tests.

-- tests/assert_positive_order_amount.sql
-- Custom test: fail if any order has negative amount
select * from {{ ref('stg_orders') }}
where amount < 0

Complete dbt Project Example

-- models/stg_customers.sql
-- Stage: clean customer data
with source as (
    select * from {{ source('ecommerce', 'customers') }}
)

select
    customer_id,
    trim(full_name) as customer_name,
    lower(email) as email,
    city,
    state,
    country,
    signup_date,
    case
        when total_orders >= 10 then 'vip'
        when total_orders >= 1 then 'active'
        else 'new'
    end as customer_tier
from source

-- models/dim_customers.sql
-- Dimension: customer details
select
    customer_id,
    customer_name,
    email,
    city,
    state,
    country,
    customer_tier,
    datediff('day', signup_date, current_date) as tenure_days
from {{ ref('stg_customers') }}

-- models/fct_orders.sql
-- Fact: order transactions
with orders as (
    select * from {{ ref('stg_orders') }}
),
customers as (
    select * from {{ ref('dim_customers') }}
)

select
    o.order_id,
    o.customer_id,
    o.order_date,
    o.amount,
    o.delivery_status,
    c.customer_tier,
    c.country,
    date_trunc('month', o.order_date) as order_month
from orders o
left join customers c on o.customer_id = c.customer_id

# dbt_project.yml
name: dodatech_analytics
version: '1.0'
config-version: 2
profile: dodatech_profile

model-paths: ["models"]
test-paths: ["tests"]
analysis-paths: ["analyses"]

models:
  dodatech_analytics:
    staging:
      +materialized: view
      +schema: staging
    marts:
      +materialized: table
      +schema: analytics
      dims:
        +materialized: table
      facts:
        +materialized: incremental
        +unique_key: order_id

# dbt_demo.py
# Simulate dbt model compilation and execution
import json

class dbtModel:
    def __init__(self, name, sql, materialization='view', refs=None):
        self.name = name
        self.sql = sql
        self.materialization = materialization
        self.refs = refs or []

    def compile(self):
        """Simulate dbt's SQL compilation (ref resolution)."""
        compiled = self.sql
        for ref in self.refs:
            placeholder = "{{ ref('" + ref + "') }}"
            compiled = compiled.replace(placeholder, f"dodatech_analytics.{ref}")
        return compiled

    def execute(self, data):
        """Simulate dbt model execution."""
        print(f"[dbt] Running model: {self.name}")
        print(f"[dbt] Materialization: {self.materialization}")
        print(f"[dbt] Compiled SQL ({len(self.compile())} chars):")
        print(f"  {self.compile()[:100]}...")
        print(f"[dbt] Result: {len(data)} rows {'inserted' if self.materialization == 'incremental' else 'created'}")
        return data

# Define models programmatically
stg_orders_sql = """
select order_id, customer_id, order_date,
       cast(amount as decimal(10,2)) as amount, status
from raw_orders
where order_id is not null
"""

fct_orders_sql = """
select o.order_id, o.customer_id, o.order_date,
       o.amount, o.status, c.customer_tier
from {{ ref('stg_orders') }} o
left join {{ ref('dim_customers') }} c
on o.customer_id = c.customer_id
"""

models = {
    "stg_orders": dbtModel("stg_orders", stg_orders_sql, "view"),
    "dim_customers": dbtModel("dim_customers", "select * from raw_customers", "table"),
    "fct_orders": dbtModel("fct_orders", fct_orders_sql, "incremental", refs=["stg_orders", "dim_customers"]),
}

# Simulate dbt run
print("=== dbt Run ===\n")
sample_data = [{"order_id": i, "amount": i * 100.0} for i in range(1, 6)]

for name, model in models.items():
    model.execute(sample_data)

print("\n=== dbt Lineage ===")
print("""
  raw_orders ──> stg_orders ──┐
                              ├──> fct_orders
  raw_customers ─> dim_customers ┘
""")

Expected output:

=== dbt Run ===

[dbt] Running model: stg_orders
[dbt] Materialization: view
[dbt] Compiled SQL (125 chars):
  select order_id, customer_id, order_date,
       cast(amount as decimal(10,2)) as amount, status
from raw_orders...
[dbt] Result: 5 rows created

[dbt] Running model: dim_customers
[dbt] Materialization: table
[dbt] Compiled SQL (24 chars):
  select * from raw_customers...
[dbt] Result: 5 rows created

[dbt] Running model: fct_orders
[dbt] Materialization: incremental
[dbt] Compiled SQL (249 chars):
  select o.order_id, o.customer_id, o.order_date,
       o.amount, o.status, c.customer_tier
from dodatech_analytics.stg_orders o
left join...
[dbt] Result: 5 rows inserted

=== dbt Lineage ===

  raw_orders ──> stg_orders ──┐
                              ├──> fct_orders
  raw_customers ─> dim_customers ┘

dbt in Production

Structure

dodatech-analytics/
├── dbt_project.yml
├── profiles.yml          # Warehouse connection config
├── models/
│   ├── staging/          # Raw → clean
│   │   ├── stg_orders.sql
│   │   └── sources.yml
│   ├── marts/
│   │   ├── dim_customers.sql
│   │   └── fct_orders.sql
│   └── docs/
│       └── overview.md
├── tests/
│   └── assert_positive_amount.sql
└── analyses/
    └── revenue_trends.sql

Commands

# Run all models
dbt run

# Run specific model with dependencies
dbt run --model fct_orders+

# Run tests
dbt test

# Generate documentation
dbt docs generate
dbt docs serve  # Opens docs in browser

# Run with full refresh (ignore incremental)
dbt run --full-refresh

# Debug connection
dbt debug

Common dbt Mistakes

1. Not Using Sources

Hard-coding table references (FROM raw_db.orders) breaks when databases change. Always use {{ source('ecommerce', 'orders') }} for raw data.

2. Over-Complicated Models

A 200-line SQL model is hard to debug. Break complex logic into smaller intermediate models. Each model should do one thing well.

3. Ignoring Tests

Without tests, bad data silently flows downstream. Add not_null, unique, and accepted_values tests to every source and model column.

4. Forgetting `--full-refresh` for Incremental Models

When you change the schema of an incremental model, existing rows don’t change. Run dbt run --full-refresh to rebuild.

5. Not Using `ref` for Dependencies

Using direct table names instead of {{ ref('model_name') }} breaks dbt’s lineage tracking and dependency resolution.

6. No Documentation

Skipping docs.md and column descriptions means no one knows what a column means. Document as you model.

Practice Questions

1. What is the difference between a dbt model materialized as ‘view’ vs ’table’?

A view creates a virtual table (runs the query each time), uses no storage, and always reflects latest data. A table persists results to disk, is faster for reads, but must be rebuilt.

2. What is the purpose of dbt sources?

Sources tell dbt about raw tables in the warehouse. They enable lineage tracking, freshness checks, and testing on upstream data.

3. How does incremental materialization work in dbt?

dbt first creates the table, then on subsequent runs only inserts/merges new records using the unique_key and a WHERE filter (typically on updated_at > (select max(updated_at) from target_table)).

4. What is the dbt ref function and why is it important?

{{ ref('model_name') }} resolves to the actual table/view name of another dbt model. It automatically handles dependency ordering, schema prefixes, and enables the lineage graph.

5. Challenge: Design a dbt project for a subscription analytics pipeline with raw tables for subscriptions, payments, and users, and output monthly churn and MRR models.

Models: stg_subscriptions, stg_payments, dim_users, fct_monthly_recurring_revenue (incremental, aggregated by month), int_churn_analysis. Tests on unique keys and foreign key relationships.

Mini Project: dbt Source Freshness Checker

# freshness_check.py
# Simulate dbt source freshness checking
from datetime import datetime, timedelta
import json

class SourceFreshness:
    def __init__(self, max_hours):
        self.max_hours = max_hours

    def check(self, source_name, table_name, last_loaded, current_time=None):
        if current_time is None:
            current_time = datetime.now()
        elapsed = (current_time - last_loaded).total_seconds() / 3600
        status = "PASS" if elapsed <= self.max_hours else "FAIL"
        return {
            "source": source_name,
            "table": table_name,
            "last_loaded": last_loaded.isoformat(),
            "elapsed_hours": round(elapsed, 1),
            "max_hours": self.max_hours,
            "status": status,
        }

checker = SourceFreshness(max_hours=24)

tables = [
    ("ecommerce", "orders", datetime.now() - timedelta(hours=2)),
    ("ecommerce", "customers", datetime.now() - timedelta(hours=18)),
    ("web_analytics", "page_views", datetime.now() - timedelta(hours=30)),
]

print(f"{'Source':<20} {'Table':<15} {'Last Loaded':<25} {'Elapsed':<10} {'Max':<10} {'Status'}")
print("-" * 90)
for source, table, last_loaded in tables:
    result = checker.check(source, table, last_loaded)
    print(f"{result['source']:<20} {result['table']:<15} {result['last_loaded']:<25} {result['elapsed_hours']:<10}h {result['max_hours']:<10}h {result['status']}")

Expected output:

Source               Table           Last Loaded               Elapsed    Max        Status
---------------------------------------------------------------------------------------------
ecommerce            orders          2026-06-14T08:00:00       2.0h       24h        PASS
ecommerce            customers       2026-06-13T16:00:00       18.0h      24h        PASS
web_analytics        page_views      2026-06-12T04:00:00       30.0h      24h        FAIL

Related Concepts

Apache Airflow

Data Modeling

ETL Pipelines

What’s Next

You now understand dbt fundamentals! Next, explore data modeling to design better warehouse schemas, and learn Apache Airflow for orchestrating dbt runs in production.

Practice daily — Convert existing SQL transforms into dbt models
Build a project — Run dbt Core with DuckDB locally on a CSV dataset
Explore related topics — Check out dbt exposures, metrics, and docs generation

Remember: every expert was once a beginner. Keep coding!

Previous Apache Airflow Guide — DAGs, Operators, and ETL Orchestration Next Apache Spark Guide — RDDs, DataFrames, and PySpark Examples

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Engineering