Analytics Engineering with Dbt & SQL -- Complete Guide to Modern Data Transformation
In this tutorial, you'll learn about Analytics Engineering with Dbt & SQL. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Analytics engineering applies software engineering best practices -- version control, testing, documentation, and CI/CD -- to data transformation, bridging the gap between raw data ingestion and business-ready analytics datasets.
What You'll Learn
In this tutorial, you will learn how to use Dbt (data build tool) with SQL to transform raw data into staging, intermediate, and mart models, implement data quality tests with Dbt tests, generate automated documentation, manage incremental models for performance, and deploy Data Pipelines with CI/CD workflows using dbt Cloud or open-source Dbt Core.
Why It Matters
Most data teams spend 60-80% of their time on manual, untested, undocumented SQL transformations that break silently and produce inconsistent reports. Analytics engineering solves this by treating data transformations as code: version-controlled, tested, documented, and deployed through CI/CD. Companies adopting analytics engineering with Dbt reduce data inconsistency incidents by 70%, cut model development time by 50%, and enable analysts to contribute production data models without engineering support.
Real-World Use
Doda Browser's data team used Dbt to rebuild their analytics Stack from 200 untested SQL scripts into 45 modular, tested, and documented models. The Dbt project generates 120 automated tests that run on every deployment, catching data quality issues before they reach dashboards. Documentation is auto-generated and hosted on Dbt Docs, reducing onboarding time for new team members from 3 weeks to 3 days.
Dbt Data Transformation Architecture
flowchart TD
A[Raw Data Sources] --> B[Staging Models]
B --> C[Source Freshness Tests]
B --> D[Intermediate Models]
D --> E[Business Logic Transformations]
E --> F[Fact Tables]
E --> G[Dimension Tables]
F --> H[Mart Models]
G --> H
H --> I[BI Tools]
H --> J[Data Exports]
B --> K[Dbt Tests]
E --> K
F --> K
H --> K
K --> L[Dbt Docs]
K --> M[CI/CD Pipeline]
Dbt Project Structure
A well-organized Dbt project follows a Layered Architecture:
-- models/staging/stg_orders.SQL
-- Stage 1: Clean raw data, rename columns, cast types
WITH source AS (
SELECT * FROM {{ source('ecommerce', 'raw_orders') }}
),
renamed AS (
SELECT
id AS order_id,
customer_id,
CAST(order_date AS DATE) AS order_date,
CAST(order_total AS DECIMAL(10,2)) AS order_amount,
status AS order_status,
payment_method,
shipping_address,
created_at::TIMESTAMP AS ingested_at
FROM source
WHERE order_date IS NOT NULL
)
SELECT * FROM renamed
Expected behavior: The staging model stg_orders creates a clean, typed view of raw order data. It renames columns for consistency, casts to appropriate data types, and filters out invalid records. The {{ source() }} function references a source defined in sources.yml for lineage tracking.
Intermediate and Mart Models
Build business logic in intermediate layers and final marts:
-- models/intermediate/int_customer_orders.SQL
-- Intermediate: Aggregate customer-level order metrics
WITH customer_orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customer_order_metrics AS (
SELECT
customer_id,
COUNT(DISTINCT order_id) AS total_orders,
COUNT(DISTINCT CASE
WHEN order_status = 'completed' THEN order_id
END) AS completed_orders,
SUM(CASE
WHEN order_status = 'completed' THEN order_amount
ELSE 0
END) AS total_revenue,
AVG(CASE
WHEN order_status = 'completed' THEN order_amount
END) AS avg_order_value,
MIN(order_date) AS first_order_date,
MAX(order_date) AS last_order_date,
CURRENT_DATE - MAX(order_date) AS days_since_last_order,
DATEDIFF('day', MIN(order_date), MAX(order_date)) AS customer_lifetime_days
FROM customer_orders
GROUP BY customer_id
),
customer_status AS (
SELECT
*,
CASE
WHEN days_since_last_order <= 30 THEN 'Active'
WHEN days_since_last_order <= 90 THEN 'Recent'
WHEN days_since_last_order <= 365 THEN 'At Risk'
ELSE 'Lost'
END AS customer_segment,
CASE
WHEN total_revenue >= 1000 THEN 'High Value'
WHEN total_revenue >= 250 THEN 'Medium Value'
ELSE 'Low Value'
END AS value_tier
FROM customer_order_metrics
)
SELECT * FROM customer_status
-- models/marts/dim_customers.sql
-- Mart: Final customer dimension for BI tools
WITH customers AS (
SELECT * FROM {{ ref('stg_customers') }}
),
customer_metrics AS (
SELECT * FROM {{ ref('int_customer_orders') }}
)
SELECT
c.customer_id,
c.customer_name,
c.email,
c.signup_date,
cm.total_orders,
cm.completed_orders,
cm.total_revenue,
cm.avg_order_value,
cm.first_order_date,
cm.last_order_date,
cm.days_since_last_order,
cm.customer_segment,
cm.value_tier,
cm.customer_lifetime_days,
CURRENT_TIMESTAMP AS refreshed_at
FROM customers c
LEFT JOIN customer_metrics cm ON c.customer_id = cm.customer_id
Expected output: The mart model dim_customers provides a complete customer dimension for BI tools. It combines customer profile data with computed behavioral metrics. BI analysts query this model directly without writing complex joins or transformations.
Data Testing with Dbt
Define and run automated data quality tests:
# tests/schema.yml
version: 2
models:
- name: stg_orders
description: Cleaned order data from raw source
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: order_amount
tests:
- not_null
- accepted_values:
values: ["> 0"]
- name: dim_customers
description: Customer dimension with behavioral metrics
columns:
- name: customer_id
tests:
- unique
- not_null
- name: total_orders
tests:
- not_null
- name: total_revenue
tests:
- Dbt_utils.expression_is_true:
expression: ">= 0"
sources:
- name: ecommerce
description: Raw ecommerce database
tables:
- name: raw_orders
freshness:
warn_after: { count: 6, period: hour }
error_after: { count: 24, period: hour }
loaded_at_field: created_at
Expected behavior: Running dbt test executes all defined tests. unique and not_null ensure data integrity. relationships tests verify foreign key consistency. freshness tests alert if raw source data stops loading. Tests that fail break the CI/CD pipeline, preventing bad data from reaching production.
Incremental Models for Performance
Optimize large-volume transformations with incremental materialization:
-- models/marts/fct_orders.SQL
-- Incremental model: Only Process new/changed records
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_Strategy='Merge',
on_schema_change='sync_all_columns'
) }}
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE order_date >= (SELECT MAX(order_date) FROM {{ this }}) - INTERVAL '3 days'
{% endif %}
),
payments AS (
SELECT * FROM {{ ref('stg_payments') }}
{% if is_incremental() %}
WHERE payment_date >= (SELECT MAX(order_date) FROM {{ this }}) - INTERVAL '3 days'
{% endif %}
)
SELECT
o.order_id,
o.customer_id,
o.order_date,
o.order_amount,
o.order_status,
o.payment_method,
COALESCE(p.payment_status, 'pending') AS payment_status,
COALESCE(p.payment_amount, 0) AS payment_amount,
CURRENT_TIMESTAMP AS model_refreshed_at
FROM orders o
LEFT JOIN payments p ON o.order_id = p.order_id
Expected behavior: On first run, the model processes all historical data. On subsequent runs, it only processes the last 3 days of orders (plus a 3-day lookback for late-arriving data). This reduces build time from hours to minutes for large datasets. The Merge Strategy updates existing records and inserts new ones.
Dbt Documentation and Lineage
Generate automated documentation with column-level lineage:
# Generate Dbt docs and serve locally
Dbt docs generate
Dbt docs serve
# Deploy to Dbt Cloud or hosting
Dbt docs generate --target prod
cp -R target/* docs/
# models/marts/dim_customers.yml
version: 2
models:
- name: dim_customers
description: |
Customer dimension table with behavioral metrics and segmentation.
Refreshed daily via incremental pipeline.
columns:
- name: customer_id
description: Unique customer identifier from source system
tests: [unique, not_null]
- name: customer_segment
description: |
Behavioral segment based on recency of last order:
- Active: ordered within 30 days
- Recent: ordered within 90 days
- At Risk: ordered within 365 days
- Lost: no order in over 365 days
- name: value_tier
description: |
Customer value based on total lifetime revenue:
- High Value: over $1,000
- Medium Value: $250-$1,000
- Low Value: under $250
Expected output: Running dbt docs generate produces a static site with lineage graphs (DAG), column-level documentation, test results, and source freshness reports. Stakeholders can explore the data model without reading SQL. The lineage Graph shows dependencies between models, enabling impact analysis before changes.
CI/CD Pipeline Configuration
Automate testing and deployment with GitHub Actions:
# .github/workflows/Dbt_ci.yml
name: Dbt CI Pipeline
on:
pull_request:
branches: [main]
paths:
- "models/**"
- "tests/**"
- "Dbt_project.yml"
jobs:
Dbt-ci:
runs-on: Ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-Python@v4
with:
Python-version: "3.10"
- name: Install Dbt
run: Pip install Dbt-postgres
- name: Dbt deps
run: Dbt deps
- name: Dbt seed
run: Dbt seed --target ci
- name: Dbt run
run: Dbt run --target ci
- name: Dbt test
run: Dbt test --target ci
- name: Dbt docs generate
run: Dbt docs generate --target ci
Expected behavior: Every Pull Request triggers the CI pipeline. It installs Dbt, runs seed data, builds all models, executes all tests, and generates documentation. If any test fails, the PR cannot Merge, ensuring only validated code reaches production. This prevents broken models and data quality issues before they affect downstream reports.
Tool Comparison
| Feature | Dbt Core | Dbt Cloud | Dataform (GCP) | SQLMesh |
|---|---|---|---|---|
| Open source | Yes | No | No | Yes |
| Testing | Built-in | Built-in | Limited | Built-in |
| Documentation | Auto-generated | Auto-generated | Basic | Basic |
| CI/CD | GitHub Actions | Built-in | Cloud Build | GitHub Actions |
| Lineage DAG | Yes | Yes | Limited | Yes |
| Incremental models | Yes | Yes | Yes | Yes |
| Version control | Git | Git | Git | Git |
| Cost | Free | $50+/mo/user | Per usage | Free |
Common Errors
1. Circular Dependencies in Model DAG
Two models referencing each other creates a circular dependency that Dbt cannot resolve. Always maintain a strict Layered Architecture: staging -> intermediate -> marts -> reporting. Never let a mart model be referenced by a staging model.
2. Missing Source Freshness Configuration
Without source freshness tests, stale data flows silently into downstream models. Configure freshness blocks in source YAML with warn_after and error_after thresholds to alert when source data stops updating.
3. Overusing Incremental Models Prematurely
Incremental models add complexity. Start with table or view materialization. Only convert to incremental when the model takes more than 10 minutes to build and processes millions of rows.
4. Not Defining Tests on Critical Columns
Without tests, data quality issues Go undetected until a stakeholder reports inconsistent numbers. Define unique, not_null, accepted_values, and relationships tests on every primary key, foreign key, and business-critical column.
5. Ignoring Model Documentation
Undocumented models create knowledge silos. Write clear descriptions for every model and column. Use Dbt Docs to generate a searchable documentation site that analysts, data scientists, and business users can explore without reading SQL.
Practice Questions
1. What is the difference between Dbt staging, intermediate, and mart models? Staging models clean and type raw source data. Intermediate models apply business logic and create reusable transformations. Mart models are the final, presentation-ready datasets consumed by BI tools and analysts.
2. How do Dbt tests ensure data quality?
Dbt tests are SQL assertions run after model builds. Built-in tests include unique, not_null, accepted_values, and relationships (foreign key checks). Custom tests use SQL for business-specific validation. Failed tests break CI/CD pipelines, preventing bad data from reaching production.
3. What is an incremental model and when should you use one? An incremental model only processes new or changed records since the last run, rather than rebuilding the entire dataset. Use incremental models when the source table has millions of rows and the full build takes more than 10 minutes.
4. How does Dbt lineage help with impact analysis? Dbt automatically generates a Directed Acyclic Graph showing dependencies between all models, sources, and exposures. Before modifying a model, analysts can see every downstream model, dashboard, and report that depends on it, preventing unexpected breakage.
5. Challenge: Build a complete Dbt project for an e-commerce data warehouse. Create staging models for orders, customers, payments, and products. Build intermediate models for customer order metrics and product performance. Create mart models for customer dimension, order fact, and product analytics. Implement tests for all primary keys, foreign keys, and business rules. Configure source freshness for all raw tables. Set up CI/CD with GitHub Actions that runs models and tests on every Pull Request.
Mini Project
Build a production-grade Dbt project from scratch. Design a layered model architecture with 4 staging models, 3 intermediate models, and 3 mart models. Implement source freshness tests on all raw tables. Define at least 20 tests covering unique, not null, relationships, and custom business rules. Configure incremental materialization on the largest fact model with a Merge Strategy. Generate and host Dbt documentation with complete column-level descriptions. Set up a CI/CD pipeline with GitHub Actions or Dbt Cloud that runs model builds, executes all tests, generates documentation, and deploys only if all tests pass. Write a project README documenting the architecture, conventions, and how to add a new model.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro