Analytics Engineering with Dbt & SQL -- Complete Guide to Modern Data Transformation

DodaTech Updated 2026-06-23 10 min read

In this tutorial, you'll learn about Analytics Engineering with Dbt & SQL. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Analytics engineering applies software engineering best practices -- version control, testing, documentation, and CI/CD -- to data transformation, bridging the gap between raw data ingestion and business-ready analytics datasets.

What You'll Learn

In this tutorial, you will learn how to use Dbt (data build tool) with SQL to transform raw data into staging, intermediate, and mart models, implement data quality tests with Dbt tests, generate automated documentation, manage incremental models for performance, and deploy Data Pipelines with CI/CD workflows using dbt Cloud or open-source Dbt Core.

Why It Matters

Most data teams spend 60-80% of their time on manual, untested, undocumented SQL transformations that break silently and produce inconsistent reports. Analytics engineering solves this by treating data transformations as code: version-controlled, tested, documented, and deployed through CI/CD. Companies adopting analytics engineering with Dbt reduce data inconsistency incidents by 70%, cut model development time by 50%, and enable analysts to contribute production data models without engineering support.

Real-World Use

Doda Browser's data team used Dbt to rebuild their analytics Stack from 200 untested SQL scripts into 45 modular, tested, and documented models. The Dbt project generates 120 automated tests that run on every deployment, catching data quality issues before they reach dashboards. Documentation is auto-generated and hosted on Dbt Docs, reducing onboarding time for new team members from 3 weeks to 3 days.

Dbt Data Transformation Architecture

flowchart TD
    A[Raw Data Sources] --> B[Staging Models]
    B --> C[Source Freshness Tests]
    B --> D[Intermediate Models]
    D --> E[Business Logic Transformations]
    E --> F[Fact Tables]
    E --> G[Dimension Tables]
    F --> H[Mart Models]
    G --> H
    H --> I[BI Tools]
    H --> J[Data Exports]
    B --> K[Dbt Tests]
    E --> K
    F --> K
    H --> K
    K --> L[Dbt Docs]
    K --> M[CI/CD Pipeline]

Dbt Project Structure

A well-organized Dbt project follows a Layered Architecture:

-- models/staging/stg_orders.SQL
-- Stage 1: Clean raw data, rename columns, cast types

WITH source AS (
    SELECT * FROM {{ source('ecommerce', 'raw_orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        customer_id,
        CAST(order_date AS DATE) AS order_date,
        CAST(order_total AS DECIMAL(10,2)) AS order_amount,
        status AS order_status,
        payment_method,
        shipping_address,
        created_at::TIMESTAMP AS ingested_at
    FROM source
    WHERE order_date IS NOT NULL
)

SELECT * FROM renamed

Expected behavior: The staging model stg_orders creates a clean, typed view of raw order data. It renames columns for consistency, casts to appropriate data types, and filters out invalid records. The {{ source() }} function references a source defined in sources.yml for lineage tracking.

Intermediate and Mart Models

Build business logic in intermediate layers and final marts:

-- models/intermediate/int_customer_orders.SQL
-- Intermediate: Aggregate customer-level order metrics

WITH customer_orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

customer_order_metrics AS (
    SELECT
        customer_id,
        COUNT(DISTINCT order_id) AS total_orders,
        COUNT(DISTINCT CASE
            WHEN order_status = 'completed' THEN order_id
        END) AS completed_orders,
        SUM(CASE
            WHEN order_status = 'completed' THEN order_amount
            ELSE 0
        END) AS total_revenue,
        AVG(CASE
            WHEN order_status = 'completed' THEN order_amount
        END) AS avg_order_value,
        MIN(order_date) AS first_order_date,
        MAX(order_date) AS last_order_date,
        CURRENT_DATE - MAX(order_date) AS days_since_last_order,
        DATEDIFF('day', MIN(order_date), MAX(order_date)) AS customer_lifetime_days
    FROM customer_orders
    GROUP BY customer_id
),

customer_status AS (
    SELECT
        *,
        CASE
            WHEN days_since_last_order <= 30 THEN 'Active'
            WHEN days_since_last_order <= 90 THEN 'Recent'
            WHEN days_since_last_order <= 365 THEN 'At Risk'
            ELSE 'Lost'
        END AS customer_segment,
        CASE
            WHEN total_revenue >= 1000 THEN 'High Value'
            WHEN total_revenue >= 250 THEN 'Medium Value'
            ELSE 'Low Value'
        END AS value_tier
    FROM customer_order_metrics
)

SELECT * FROM customer_status

-- models/marts/dim_customers.sql
-- Mart: Final customer dimension for BI tools

WITH customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
),

customer_metrics AS (
    SELECT * FROM {{ ref('int_customer_orders') }}
)

SELECT
    c.customer_id,
    c.customer_name,
    c.email,
    c.signup_date,
    cm.total_orders,
    cm.completed_orders,
    cm.total_revenue,
    cm.avg_order_value,
    cm.first_order_date,
    cm.last_order_date,
    cm.days_since_last_order,
    cm.customer_segment,
    cm.value_tier,
    cm.customer_lifetime_days,
    CURRENT_TIMESTAMP AS refreshed_at
FROM customers c
LEFT JOIN customer_metrics cm ON c.customer_id = cm.customer_id

Expected output: The mart model dim_customers provides a complete customer dimension for BI tools. It combines customer profile data with computed behavioral metrics. BI analysts query this model directly without writing complex joins or transformations.

Data Testing with Dbt

Define and run automated data quality tests:

# tests/schema.yml
version: 2

models:
  - name: stg_orders
    description: Cleaned order data from raw source
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id
      - name: order_amount
        tests:
          - not_null
          - accepted_values:
              values: ["> 0"]

  - name: dim_customers
    description: Customer dimension with behavioral metrics
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: total_orders
        tests:
          - not_null
      - name: total_revenue
        tests:
          - Dbt_utils.expression_is_true:
              expression: ">= 0"

sources:
  - name: ecommerce
    description: Raw ecommerce database
    tables:
      - name: raw_orders
        freshness:
          warn_after: { count: 6, period: hour }
          error_after: { count: 24, period: hour }
        loaded_at_field: created_at

Expected behavior: Running dbt test executes all defined tests. unique and not_null ensure data integrity. relationships tests verify foreign key consistency. freshness tests alert if raw source data stops loading. Tests that fail break the CI/CD pipeline, preventing bad data from reaching production.

Incremental Models for Performance

Optimize large-volume transformations with incremental materialization:

-- models/marts/fct_orders.SQL
-- Incremental model: Only Process new/changed records

{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_Strategy='Merge',
    on_schema_change='sync_all_columns'
) }}

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
    {% if is_incremental() %}
    WHERE order_date >= (SELECT MAX(order_date) FROM {{ this }}) - INTERVAL '3 days'
    {% endif %}
),

payments AS (
    SELECT * FROM {{ ref('stg_payments') }}
    {% if is_incremental() %}
    WHERE payment_date >= (SELECT MAX(order_date) FROM {{ this }}) - INTERVAL '3 days'
    {% endif %}
)

SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    o.order_amount,
    o.order_status,
    o.payment_method,
    COALESCE(p.payment_status, 'pending') AS payment_status,
    COALESCE(p.payment_amount, 0) AS payment_amount,
    CURRENT_TIMESTAMP AS model_refreshed_at
FROM orders o
LEFT JOIN payments p ON o.order_id = p.order_id

Expected behavior: On first run, the model processes all historical data. On subsequent runs, it only processes the last 3 days of orders (plus a 3-day lookback for late-arriving data). This reduces build time from hours to minutes for large datasets. The Merge Strategy updates existing records and inserts new ones.

Dbt Documentation and Lineage

Generate automated documentation with column-level lineage:

# Generate Dbt docs and serve locally
Dbt docs generate
Dbt docs serve

# Deploy to Dbt Cloud or hosting
Dbt docs generate --target prod
cp -R target/* docs/

# models/marts/dim_customers.yml
version: 2

models:
  - name: dim_customers
    description: |
      Customer dimension table with behavioral metrics and segmentation.
      Refreshed daily via incremental pipeline.

    columns:
      - name: customer_id
        description: Unique customer identifier from source system
        tests: [unique, not_null]

      - name: customer_segment
        description: |
          Behavioral segment based on recency of last order:
          - Active: ordered within 30 days
          - Recent: ordered within 90 days
          - At Risk: ordered within 365 days
          - Lost: no order in over 365 days

      - name: value_tier
        description: |
          Customer value based on total lifetime revenue:
          - High Value: over $1,000
          - Medium Value: $250-$1,000
          - Low Value: under $250

Expected output: Running dbt docs generate produces a static site with lineage graphs (DAG), column-level documentation, test results, and source freshness reports. Stakeholders can explore the data model without reading SQL. The lineage Graph shows dependencies between models, enabling impact analysis before changes.

CI/CD Pipeline Configuration

Automate testing and deployment with GitHub Actions:

# .github/workflows/Dbt_ci.yml
name: Dbt CI Pipeline

on:
  pull_request:
    branches: [main]
    paths:
      - "models/**"
      - "tests/**"
      - "Dbt_project.yml"

jobs:
  Dbt-ci:
    runs-on: Ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-Python@v4
        with:
          Python-version: "3.10"

      - name: Install Dbt
        run: Pip install Dbt-postgres

      - name: Dbt deps
        run: Dbt deps

      - name: Dbt seed
        run: Dbt seed --target ci

      - name: Dbt run
        run: Dbt run --target ci

      - name: Dbt test
        run: Dbt test --target ci

      - name: Dbt docs generate
        run: Dbt docs generate --target ci

Expected behavior: Every Pull Request triggers the CI pipeline. It installs Dbt, runs seed data, builds all models, executes all tests, and generates documentation. If any test fails, the PR cannot Merge, ensuring only validated code reaches production. This prevents broken models and data quality issues before they affect downstream reports.

Tool Comparison

Feature	Dbt Core	Dbt Cloud	Dataform (GCP)	SQLMesh
Open source	Yes	No	No	Yes
Testing	Built-in	Built-in	Limited	Built-in
Documentation	Auto-generated	Auto-generated	Basic	Basic
CI/CD	GitHub Actions	Built-in	Cloud Build	GitHub Actions
Lineage DAG	Yes	Yes	Limited	Yes
Incremental models	Yes	Yes	Yes	Yes
Version control	Git	Git	Git	Git
Cost	Free	$50+/mo/user	Per usage	Free

Common Errors

1. Circular Dependencies in Model DAG

Two models referencing each other creates a circular dependency that Dbt cannot resolve. Always maintain a strict Layered Architecture: staging -> intermediate -> marts -> reporting. Never let a mart model be referenced by a staging model.

2. Missing Source Freshness Configuration

Without source freshness tests, stale data flows silently into downstream models. Configure freshness blocks in source YAML with warn_after and error_after thresholds to alert when source data stops updating.

3. Overusing Incremental Models Prematurely

Incremental models add complexity. Start with table or view materialization. Only convert to incremental when the model takes more than 10 minutes to build and processes millions of rows.

4. Not Defining Tests on Critical Columns

Without tests, data quality issues Go undetected until a stakeholder reports inconsistent numbers. Define unique, not_null, accepted_values, and relationships tests on every primary key, foreign key, and business-critical column.

5. Ignoring Model Documentation

Undocumented models create knowledge silos. Write clear descriptions for every model and column. Use Dbt Docs to generate a searchable documentation site that analysts, data scientists, and business users can explore without reading SQL.

Practice Questions

1. What is the difference between Dbt staging, intermediate, and mart models? Staging models clean and type raw source data. Intermediate models apply business logic and create reusable transformations. Mart models are the final, presentation-ready datasets consumed by BI tools and analysts.

2. How do Dbt tests ensure data quality? Dbt tests are SQL assertions run after model builds. Built-in tests include unique, not_null, accepted_values, and relationships (foreign key checks). Custom tests use SQL for business-specific validation. Failed tests break CI/CD pipelines, preventing bad data from reaching production.

3. What is an incremental model and when should you use one? An incremental model only processes new or changed records since the last run, rather than rebuilding the entire dataset. Use incremental models when the source table has millions of rows and the full build takes more than 10 minutes.

4. How does Dbt lineage help with impact analysis? Dbt automatically generates a Directed Acyclic Graph showing dependencies between all models, sources, and exposures. Before modifying a model, analysts can see every downstream model, dashboard, and report that depends on it, preventing unexpected breakage.

5. Challenge: Build a complete Dbt project for an e-commerce data warehouse. Create staging models for orders, customers, payments, and products. Build intermediate models for customer order metrics and product performance. Create mart models for customer dimension, order fact, and product analytics. Implement tests for all primary keys, foreign keys, and business rules. Configure source freshness for all raw tables. Set up CI/CD with GitHub Actions that runs models and tests on every Pull Request.

Mini Project

Build a production-grade Dbt project from scratch. Design a layered model architecture with 4 staging models, 3 intermediate models, and 3 mart models. Implement source freshness tests on all raw tables. Define at least 20 tests covering unique, not null, relationships, and custom business rules. Configure incremental materialization on the largest fact model with a Merge Strategy. Generate and host Dbt documentation with complete column-level descriptions. Set up a CI/CD pipeline with GitHub Actions or Dbt Cloud that runs model builds, executes all tests, generates documentation, and deploys only if all tests pass. Write a project README documenting the architecture, conventions, and how to add a new model.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Data Analysis with Python for Business -- Complete Guide with Pandas, NumPy & Matplotlib

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Analytics