Learn Data Lakehouse Architecture — Delta Lake, Iceberg, and Hudi Explained

Data Lakehouse Architecture — Delta Lake, Iceberg, and Hudi Explained

DodaTech Updated Jun 20, 2026 11 min read

A data lakehouse combines the schema flexibility and cheap storage of a data lake with the ACID transactions and performance of a data warehouse — no separate warehouse needed.

What You’ll Learn

By the end of this tutorial, you’ll understand how Delta Lake, Apache Iceberg, and Apache Hudi bring ACID transactions, schema evolution, and time travel to data lakes, and how to choose between them.

Why the Lakehouse Matters

Traditional data warehouses (Snowflake, Redshift) give you ACID but cost a fortune for petabyte-scale storage. Data lakes (S3 + Spark) are cheap but lack transactions — you get corrupted data from concurrent writes, no rollbacks, and no schema enforcement. The lakehouse solves this. Doda Browser uses a Delta Lake-based lakehouse to process 50TB+ of analytics data while keeping storage costs 80% lower than a pure warehouse.

Lakehouse Learning Path


flowchart LR
  A[Data Lakes] --> B[Data Lakehouse]
  B --> C{You Are Here}
  C --> D[Delta Lake]
  C --> E[Apache Iceberg]
  C --> F[Apache Hudi]
  D --> G[ACID + Time Travel]
  E --> H[Table Formats]
  F --> I[Upsert/Merge]

Prerequisites: Data Lakes, Apache Spark basics, Data Warehousing concepts.

What Is a Data Lakehouse?

Picture a library. A data lake is like dumping all books on the floor — cheap, flexible, but impossible to find anything reliably. A data warehouse is a librarian who organizes every book on specific shelves — fast to find, but expensive to maintain. A lakehouse is a self-organizing library where books stay on the floor (cheap storage) but a magical catalog tracks exactly where each book is, who’s reading it, and what changed.

The three leading lakehouse storage formats are:

Format	Creator	Key Innovation	Best For
Delta Lake	Databricks	Transaction log + Spark-native	Databricks ecosystem
Apache Iceberg	Netflix	Open table format, wide engine support	Multi-engine (Spark, Flink, Trino)
Apache Hudi	Uber	Incremental processing, upserts	Streaming ingestion

Delta Lake Deep Dive

Delta Lake adds a transaction log (stored as JSON files in _delta_log/) on top of Parquet data. Every operation — insert, update, delete — writes a new entry to this log. Readers use the latest log entry to see the current state.

Delta Lake Setup

# delta_setup.py
# Initialize Delta Lake with Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DeltaLakeTutorial") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Write data as Delta table
data = [
    ("order_1", "pending", 250.0, "2026-06-01"),
    ("order_2", "completed", 100.0, "2026-06-01"),
    ("order_3", "cancelled", 75.0, "2026-06-02"),
]
columns = ["order_id", "status", "amount", "date"]
df = spark.createDataFrame(data, columns)

df.write.format("delta").mode("overwrite").save("/tmp/delta_orders")
print("Delta table created at /tmp/delta_orders")

# Verify
read_df = spark.read.format("delta").load("/tmp/delta_orders")
read_df.show()

Expected output:

Delta table created at /tmp/delta_orders
+--------+---------+------+----------+
|order_id|   status|amount|      date|
+--------+---------+------+----------+
| order_1|  pending| 250.0|2026-06-01|
| order_2|completed| 100.0|2026-06-01|
| order_3|cancelled|  75.0|2026-06-02|
+--------+---------+------+----------+

ACID Transactions on a Data Lake

Delta Lake ensures ACID using optimistic concurrency control:

# delta_acid.py
# Demonstrate ACID transaction with concurrent writes
from pyspark.sql import SparkSession
import threading
import time

spark = SparkSession.builder \
    .appName("DeltaACID") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Create initial table
data = [("user_1", 100), ("user_2", 200)]
df = spark.createDataFrame(data, ["user", "balance"])
df.write.format("delta").mode("overwrite").save("/tmp/delta_balances")
print("Initial balance table created")

# Simulate two concurrent transfers
def transfer_from_user1():
    spark_session = SparkSession.builder \
        .appName("Transfer1") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .getOrCreate()
    
    for _ in range(10):
        df = spark_session.read.format("delta").load("/tmp/delta_balances")
        df.createOrReplaceTempView("balances")
        spark_session.sql("""
            MERGE INTO delta.`/tmp/delta_balances` t
            USING (SELECT 'user_1' as user, -10 as delta) s
            ON t.user = s.user
            WHEN MATCHED THEN UPDATE SET t.balance = t.balance + s.delta
        """)
        time.sleep(0.1)
    print("Transfer from user_1 complete")

def transfer_from_user2():
    spark_session = SparkSession.builder \
        .appName("Transfer2") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .getOrCreate()
    
    for _ in range(10):
        df = spark_session.read.format("delta").load("/tmp/delta_balances")
        df.createOrReplaceTempView("balances")
        spark_session.sql("""
            MERGE INTO delta.`/tmp/delta_balances` t
            USING (SELECT 'user_2' as user, -10 as delta) s
            ON t.user = s.user
            WHEN MATCHED THEN UPDATE SET t.balance = t.balance + s.delta
        """)
        time.sleep(0.1)
    print("Transfer from user_2 complete")

t1 = threading.Thread(target=transfer_from_user1)
t2 = threading.Thread(target=transfer_from_user2)
t1.start()
t2.start()
t1.join()
t2.join()

# Verify final balances
final = spark.read.format("delta").load("/tmp/delta_balances")
final.show()

Expected output:

Initial balance table created
Transfer from user_1 complete
Transfer from user_2 complete
+------+-------+
|  user|balance|
+------+-------+
|user_1|      0|
|user_2|    100|
+------+-------+

Schema Evolution and Time Travel

# delta_evolution.py
# Schema evolution + time travel in Delta Lake
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DeltaEvolution") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Version 1: Original schema
df1 = spark.createDataFrame([(1, "alice")], ["id", "name"])
df1.write.format("delta").mode("overwrite").save("/tmp/delta_evolution")
print("Version 1 written")

# Version 2: Add column (schema evolution)
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
df2 = spark.createDataFrame([(2, "bob", "admin")], ["id", "name", "role"])
df2.write.format("delta").mode("append").save("/tmp/delta_evolution")
print("Version 2 written (schema evolved)")

# Version 3: Update data
spark.sql("""
    UPDATE delta.`/tmp/delta_evolution`
    SET role = 'super_admin'
    WHERE id = 2
""")
print("Version 3 written (update)")

# Time travel: read version 1
v1 = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta_evolution")
print("Version 1 (time travel):")
v1.show()

# Time travel: read version 2
v2 = spark.read.format("delta").option("versionAsOf", 1).load("/tmp/delta_evolution")
print("Version 2 (time travel):")
v2.show()

# Current version
current = spark.read.format("delta").load("/tmp/delta_evolution")
print("Current version:")
current.show()

# Describe history
history = spark.sql("DESCRIBE HISTORY delta.`/tmp/delta_evolution`")
history.select("version", "operation", "operationParameters").show(truncate=False)

Expected output:

Version 1 written
Version 2 written (schema evolved)
Version 3 written (update)
Version 1 (time travel):
+---+-----+
| id| name|
+---+-----+
|  1|alice|
+---+-----+

Version 2 (time travel):
+---+-----+-----+
| id| name| role|
+---+-----+-----+
|  1|alice| null|
|  2|  bob|admin|
+---+-----+-----+

Current version:
+---+-----+-----------+
| id| name|       role|
+---+-----+-----------+
|  1|alice|       null|
|  2|  bob|super_admin|
+---+-----+-----------+

+-------+---------+----------------------------------------------------+
|version|operation|operationParameters                                  |
+-------+---------+----------------------------------------------------+
|2      |WRITE    |{mode -> Append, partitionBy -> []}                  |
|1      |WRITE    |{mode -> Append, partitionBy -> []}                  |
|0      |WRITE    |{mode -> Overwrite, partitionBy -> []}               |
+-------+---------+----------------------------------------------------+

Apache Iceberg Overview

Iceberg treats an entire table as a logical unit with a manifest-based architecture. It catalogs data files in a tree of manifest files, making it efficient for engines to prune partitions without reading metadata.

-- Iceberg SQL example (Trino)
CREATE TABLE iceberg.analytics.orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    order_date DATE
)
WITH (
    format = 'PARQUET',
    partitioning = ARRAY['month(order_date)']
);

INSERT INTO iceberg.analytics.orders VALUES
    (1001, 501, 250.00, DATE '2026-06-01'),
    (1002, 502, 45.50, DATE '2026-06-02');

-- Time travel snapshot
SELECT * FROM iceberg.analytics.orders
    FOR SYSTEM_TIME AS OF TIMESTAMP '2026-06-01 12:00:00';

-- Schema evolution: add column
ALTER TABLE iceberg.analytics.orders ADD COLUMN status VARCHAR;

Apache Hudi Overview

Hudi (Hadoop Upserts Deletes and Incrementals) specializes in streaming ingestion with upserts and incremental queries.

# hudi_setup.py
# Apache Hudi basic write with upserts
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("HudiTutorial") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
    .getOrCreate()

# Initial batch
df = spark.createDataFrame([
    ("order_1", "pending", 250.0, "2026-06-01"),
    ("order_2", "completed", 100.0, "2026-06-01"),
], ["order_id", "status", "amount", "date"])

df.write.format("hudi") \
    .option("hoodie.table.name", "hudi_orders") \
    .option("hoodie.datasource.write.recordkey.field", "order_id") \
    .option("hoodie.datasource.write.table.name", "hudi_orders") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .mode("overwrite") \
    .save("/tmp/hudi_orders")
print("Hudi table created")

# Upsert: change order_1 status and add order_3
upsert_df = spark.createDataFrame([
    ("order_1", "completed", 250.0, "2026-06-02"),
    ("order_3", "pending", 300.0, "2026-06-02"),
], ["order_id", "status", "amount", "date"])

upsert_df.write.format("hudi") \
    .option("hoodie.table.name", "hudi_orders") \
    .option("hoodie.datasource.write.recordkey.field", "order_id") \
    .option("hoodie.datasource.write.table.name", "hudi_orders") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .mode("append") \
    .save("/tmp/hudi_orders")

# Read final state
result = spark.read.format("hudi").load("/tmp/hudi_orders")
result.show()

Expected output:

Hudi table created
+--------+---------+------+----------+
|order_id|   status|amount|      date|
+--------+---------+------+----------+
| order_1|completed| 250.0|2026-06-02|
| order_2|completed| 100.0|2026-06-01|
| order_3|  pending| 300.0|2026-06-02|
+--------+---------+------+----------+

Lakehouse Architecture


flowchart TB
  subgraph "Storage Layer"
    S3[Object Store
S3 / ADLS / GCS]
    PARQ[Parquet Files]
  end
  subgraph "Table Format Layer"
    DL[Delta Lake
Transaction Log]
    ICE[Iceberg
Manifest Tree]
    HUD[Hudi
Timeline Service]
  end
  subgraph "Compute Engines"
    SP[Apache Spark]
    FL[Apache Flink]
    TR[Trino / Presto]
    HS[Hive / Spark SQL]
  end
  S3 --> PARQ
  PARQ --> DL
  PARQ --> ICE
  PARQ --> HUD
  DL --> SP
  DL --> FL
  ICE --> SP
  ICE --> TR
  ICE --> FL
  HUD --> SP
  HUD --> FL

Lakehouse vs Warehouse vs Data Lake

Feature	Data Lake	Data Warehouse	Lakehouse
Storage cost	Low ($/TB/mo)	High	Low
ACID transactions	No	Yes	Yes
Schema enforcement	None	Strict	Flexible
BI tool support	Poor	Excellent	Good
ML/AI support	Excellent	Poor	Excellent
Time travel	No	Limited	Yes
Upsert/merge	Manual	Yes	Yes

Common Lakehouse Mistakes

1. Not Compacting Small Files

Streaming jobs create thousands of tiny Parquet files. Without compaction (OPTIMIZE in Delta, bin-pack in Hudi), query performance degrades. Run OPTIMIZE delta.\/path`` or set write triggers.

2. Ignoring Partitioning Strategy

Without proper partitioning, every query scans all data. Partition on high-cardinality columns like date but avoid over-partitioning (more than 10k partitions slows metadata operations).

3. Using Delta Lake Without Vacuum

Delta Lake retains all transaction logs forever. Old versions consume storage. Run VACUUM delta.\/path` RETAIN 168 HOURS` to clean versions older than 7 days.

4. Missing Schema Evolution on Writes

By default, Delta Lake rejects writes with new columns. Set mergeSchema: true or use autoMerge — otherwise your pipeline breaks when upstream adds a column.

5. Choosing the Wrong Format for the Workload

Use Delta Lake for Databricks-heavy stacks, Iceberg for multi-engine environments (Spark + Flink + Trino), and Hudi for streaming ingest with frequent upserts. Picking wrong means rewriting pipelines later.

6. Not Cleaning Up Orphan Files

Failed writes leave orphan files. Iceberg has remove_orphan_files, Delta has FSCK REPAIR TABLE. Run these weekly to avoid storage waste.

7. Overusing Time Travel for Long Periods

Time travel keeps all historical versions. If you retain 90 days of history for a 10TB table, you pay for 900TB of storage. Use VACUUM to enforce retention limits.

Practice Questions

1. What problem does a lakehouse solve that a data lake doesn’t?

A data lake lacks ACID transactions, so concurrent writes can corrupt data, partial failures leave garbage, and rollbacks are impossible. The lakehouse adds a transactional metadata layer that ensures atomicity, consistency, isolation, and durability on cheap object storage.

2. What are the three main lakehouse formats and their creators?

Delta Lake (Databricks), Apache Iceberg (Netflix), and Apache Hudi (Uber). Delta uses a transaction log, Iceberg uses a manifest tree, and Hudi uses a timeline service.

3. How does time travel work in Delta Lake?

Each write creates a new version in the transaction log. Readers can query any version using versionAsOf or timestampAsOf options. Old versions are retained until VACUUM removes them.

4. When would you choose Apache Iceberg over Delta Lake?

Iceberg is format-agnostic (works with Parquet, Avro, ORC) and supports more query engines (Spark, Flink, Trino, Presto, Hive). Choose Iceberg when you need engine flexibility or are not on Databricks.

5. Challenge: Design a lakehouse pipeline that ingests 100GB/hour of IoT sensor data with late-arriving updates.

Use Hudi for its upsert performance on streaming data. Partition by hour(date) and use MERGE INTO for late arrivals. Compact every 100 commits. Set hoodie.cleaner.policy to keep 7 days of history. Store raw data in S3, query with Trino.

Mini Project: Delta Lake CDC Pipeline

# cdc_pipeline.py
# Simulate a CDC pipeline using Delta Lake's MERGE
from pyspark.sql import SparkSession
from datetime import datetime

spark = SparkSession.builder \
    .appName("CDCPipeline") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Target table (existing records)
target_data = [
    ("user_1", "alice@email.com", "active", "2026-01-01"),
    ("user_2", "bob@email.com", "active", "2026-01-01"),
]
target = spark.createDataFrame(target_data, ["user_id", "email", "status", "created_at"])
target.write.format("delta").mode("overwrite").save("/tmp/delta_users")
print("Target table initialized")

# CDC stream (changes from source)
cdc_data = [
    ("user_1", "alice@newdomain.com", "active"),   # email change
    ("user_3", "carol@email.com", "active"),        # new user
    ("user_2", "bob@email.com", "inactive"),        # status change
]
cdc = spark.createDataFrame(cdc_data, ["user_id", "email", "status"])

cdc.createOrReplaceTempView("cdc_changes")

# MERGE: insert new, update existing
spark.sql("""
    MERGE INTO delta.`/tmp/delta_users` t
    USING cdc_changes s
    ON t.user_id = s.user_id
    WHEN MATCHED THEN UPDATE SET
        t.email = s.email,
        t.status = s.status
    WHEN NOT MATCHED THEN INSERT
        (user_id, email, status, created_at)
        VALUES (s.user_id, s.email, s.status, '2026-06-20')
""")

print("CDC merge complete:")
result = spark.read.format("delta").load("/tmp/delta_users")
result.show()

# Inspect transaction log
history = spark.sql("DESCRIBE HISTORY delta.`/tmp/delta_users`")
history.select("version", "operation", "operationParameters").show(truncate=False)

Expected output:

Target table initialized
CDC merge complete:
+-------+------------------+--------+----------+
|user_id|             email|  status|created_at|
+-------+------------------+--------+----------+
| user_1|alice@newdomain...|  active|2026-01-01|
| user_2|   bob@email.com|inactive|2026-01-01|
| user_3|  carol@email.com|  active|2026-06-20|
+-------+------------------+--------+----------+

Related Concepts

Data Lakes Guide

Data Warehousing Guide

Apache Spark Guide

Modern Data Warehousing

FAQ

Can a lakehouse replace a data warehouse?

For most analytics workloads, yes. The lakehouse provides ACID, SQL access, and BI tool compatibility. However, for workloads requiring sub-second query latency on star schemas, a dedicated warehouse like Snowflake still performs better.

What storage formats does a lakehouse use?

Parquet is the default for all three formats (Delta, Iceberg, Hudi). Iceberg also supports Avro and ORC. Data is stored in object storage (S3, ADLS, GCS) while metadata is maintained separately.

Do I need Apache Spark for a lakehouse?

Spark is the most common compute engine, but Iceberg and Hudi support Trino, Presto, Flink, and Hive. You can query Iceberg tables with DuckDB or use Athena on top of any format.

How does schema evolution work in practice?

Delta Lake auto-merges schemas when mergeSchema is true. Iceberg supports ADD COLUMN, DROP COLUMN, and RENAME COLUMN without rewriting data. Hudi handles schema changes through its timeline service.

What’s Next

You now understand lakehouse architecture and the three main formats. Next, explore modern data warehousing with Snowflake and BigQuery, then learn how to stream process data alongside your lakehouse.

Practice daily — Create a Delta Lake table from a CSV, evolve the schema, and time-travel to an older version
Build a project — Set up a lakehouse on S3 with Apache Iceberg and query it with Trino
Compare formats — Run the same workload on Delta, Iceberg, and Hudi to see performance differences

Remember: every expert was once a beginner. Keep building!

Previous Data Modeling Guide — Kimball, Inmon, Star Schema, and Slowly Changing Dimensions Next Data Pipeline Orchestration — Airflow, Prefect, and Dagster Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Engineering