Skip to content
Data Lakes Explained — Lakehouse Architecture and Schema-on-Read

Data Lakes Explained — Lakehouse Architecture and Schema-on-Read

DodaTech Updated Jun 15, 2026 9 min read

A data lake is a centralized repository that stores all data — structured, semi-structured, and unstructured — in its raw, original format, using object storage and schema-on-read for flexible analysis.

What You’ll Learn

By the end of this tutorial, you’ll understand what data lakes are, how lakehouse architecture combines lake and warehouse benefits, schema-on-read vs schema-on-write, and when to use a data lake vs a data warehouse.

Why Data Lakes Matter

Modern organizations collect data from dozens of sources — databases, APIs, IoT sensors, logs, images, social media. A data warehouse requires you to define schemas and transform data before loading. Data lakes store everything raw, giving data scientists and analysts flexibility to explore without upfront modeling. DodaTech’s Durga Antivirus Pro uses a data lake to store raw malware samples and threat intelligence feeds.

Data Lakes Learning Path


flowchart LR
  A[Data Warehousing] --> B[Data Lakes]
  B --> C{You Are Here}
  C --> D[Object Storage]
  C --> E[Schema-on-Read]
  C --> F[Lakehouse]
  D --> G[S3]
  D --> H[ADLS]
  F --> I[Delta Lake]
  F --> J[Apache Iceberg]

Prerequisites: Understanding of data warehousing and ETL pipelines. Familiarity with cloud object storage helps.

What Is a Data Lake?

Think of a data lake like a giant warehouse where you store every box as-is. You don’t sort, label, or organize boxes when they arrive. When someone needs to find something, they search relevant boxes and extract what they need at that moment.

A data warehouse is like a library — everything is cataloged, organized on shelves, and easy to find. But preparing items for the library takes time and you can’t add things that don’t fit the cataloging system.

Data LakeData Warehouse
Data formatRaw, native formatProcessed, structured
SchemaSchema-on-read (apply at query time)Schema-on-write (defined before load)
Storage costLow (object storage)Higher (compute-optimized)
Data typesAll types (text, JSON, images, video)Structured and semi-structured
UsersData scientists, data engineersAnalysts, business users
AgilityHigh — explore without modelingLower — schema changes are expensive

Schema-on-Read vs Schema-on-Write

Schema-on-Write (Data Warehouse)

You define the schema before loading data. If a column is added in the source, you must alter the table before loading.

# Schema-on-write: Table must exist before loading
# CREATE TABLE sales (id INT, amount DECIMAL, date DATE);
# INSERT INTO sales VALUES (1, 100.50, '2026-06-01');
# A new column 'region' would require: ALTER TABLE sales ADD COLUMN region TEXT;

Schema-on-Read (Data Lake)

You store raw data as-is (JSON, Parquet, CSV). The schema is applied when you query it.

# Schema-on-read: Store raw, interpret at query time
# Raw data in S3: events/2026/06/01/events.json
# Spark query:
# df = spark.read.json("s3://datalake/events/2026/06/01/")
# df.createOrReplaceTempView("events")
# spark.sql("SELECT event_type, COUNT(*) FROM events GROUP BY event_type").show()
# Schema is inferred from the data itself at read time.

Lakehouse Architecture

The lakehouse combines the flexibility of data lakes with the reliability and performance of data warehouses.


flowchart TB
  subgraph "Lakehouse Architecture"
    direction TB
    RAW[Raw Zone
S3/ADLS] --> STAGE[Stage Zone
Cleaned Parquet] STAGE --> CURATED[Curated Zone
Delta Tables] CURATED --> ML[ML Pipelines] CURATED --> BI[BI Dashboards] CURATED --> SQL[SQL Analytics] end style RAW fill:#ff6b6b,color:#fff style STAGE fill:#feca57,color:#333 style CURATED fill:#48dbfb,color:#333

Key Lakehouse Technologies

TechnologyCompanyKey Feature
Delta LakeDatabricksACID transactions on Parquet, time travel, schema enforcement
Apache IcebergNetflix/AppleTable format for huge datasets, partition evolution
Apache HudiUberIncremental processing, upserts on data lakes

Data Lake Storage Raw Data

# datalake_operations.py
# Simulate data lake storage and schema-on-read
import json
import os
from datetime import datetime

class DataLake:
    def __init__(self, base_path="datalake"):
        self.base_path = base_path
        os.makedirs(f"{base_path}/raw", exist_ok=True)
        os.makedirs(f"{base_path}/curated", exist_ok=True)

    def ingest_raw(self, source, data, partition_by="date"):
        """Store data in raw format with partitioning."""
        partition = datetime.now().strftime("%Y/%m/%d")
        path = f"{self.base_path}/raw/{source}/{partition}"
        os.makedirs(path, exist_ok=True)
        filename = f"{datetime.now().strftime('%H%M%S')}.json"
        with open(f"{path}/{filename}", 'w') as f:
            json.dump(data, f)
        print(f"[INGEST] Raw data -> {path}/{filename}")
        return f"{path}/{filename}"

    def read_raw(self, path):
        """Schema-on-read: apply interpretation at read time."""
        with open(path) as f:
            raw = json.load(f)
        # Schema is applied at read time — not on ingest
        schema = {
            "event": str,
            "user_id": str,
            "value": float,
            "tags": list,
        }
        validated = {}
        for field, expected_type in schema.items():
            value = raw.get(field)
            if value is not None:
                try:
                    validated[field] = expected_type(value)
                except (ValueError, TypeError):
                    validated[field] = None
            else:
                validated[field] = None
        print(f"[READ] Schema-on-read applied: {validated}")
        return validated

# Simulate ingestion of different data types
lake = DataLake()

# Ingest clickstream event
click_event = {
    "event": "page_view",
    "user_id": "user_12345",
    "value": 1.0,
    "tags": ["homepage", "organic"],
    "browser": "Chrome",  # Extra field — not in schema, stored anyway
    "ip": "192.168.1.1",  # Another extra field
}
path = lake.ingest_raw("clickstream", click_event)

# Ingest sales event
sale_event = {
    "event": "purchase",
    "user_id": "user_12345",
    "value": 149.99,
    "tags": ["checkout", "promo_summer"],
    "order_id": "ORD-2026-0001",
}
lake.ingest_raw("sales", sale_event)

print("\n=== Schema-on-Read for Clickstream ===")
lake.read_raw(path)

print("\n=== Exploring Raw Data Lake ===")
for root, dirs, files in os.walk(lake.base_path):
    level = root.replace(lake.base_path, '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    if files:
        sub_indent = ' ' * 2 * (level + 1)
        for file in files[:3]:
            size = os.path.getsize(os.path.join(root, file))
            print(f"{sub_indent}{file} ({size} bytes)")

Expected output:

[INGEST] Raw data -> datalake/raw/clickstream/2026/06/15/100000.json
[INGEST] Raw data -> datalake/raw/sales/2026/06/15/100001.json

=== Schema-on-Read for Clickstream ===
[READ] Schema-on-read applied: {'event': 'page_view', 'user_id': 'user_12345', 'value': 1.0, 'tags': ['homepage', 'organic']}

=== Exploring Raw Data Lake ===
datalake/
  raw/
    clickstream/
      2026/
        06/
          15/
            100000.json (194 bytes)
    sales/
      2026/
        06/
          15/
            100001.json (167 bytes)
  curated/

Data Lake vs Data Warehouse — When to Use Which


flowchart TD
  Q[What are you doing with the data?]
  Q --> A[Analysis / Reporting]
  Q --> B[ML / Exploration]
  Q --> C[Real-time / Streaming]
  A --> W[Data Warehouse]
  B --> L[Data Lake]
  C --> L
  subgraph "Hybrid: Lakehouse"
    H[Both — use Delta/Iceberg]
  end

ScenarioUse Data LakeUse Data Warehouse
Exploratory analysis on new data
ML model training with raw features
Structured BI reports for executives
Schema flexibility (unknown requirements)
Low storage cost for petabytes
Fast, consistent SQL for analysts
ACID transactions on dataDelta Lake

Common Data Lake Mistakes

1. Creating a “Data Swamp”

A data lake with no organization, no metadata, and no governance becomes a data swamp where nothing can be found. Implement partitioning, cataloging, and naming conventions from day one.

2. Not Managing Permissions

Raw data may contain PII, financial details, or credentials. Apply access controls at the storage level (IAM policies, bucket policies) and restrict who can read raw zones.

3. Ignoring Small Files Problem

Storing millions of tiny CSV files kills query performance. Coalesce small files into larger Parquet files (100MB-1GB) using Spark or scheduled compaction jobs.

4. Writing Without Schema Validation

With no schema-on-write, bad data can silently enter the lake. Write validation scripts or use tools like Great Expectations to catch corrupted records.

5. No Data Retention Policies

Raw data accumulates fast. Without lifecycle policies, storage costs explode. Move cold data to cheaper tiers (S3 Glacier, Azure Archive) and delete duplicates.

6. Using Only Raw Zone

A lake with only raw data is hard to use. Implement a medallion architecture: Bronze (raw), Silver (cleaned), Gold (aggregated/curated).

Practice Questions

1. What is schema-on-read and how is it different from schema-on-write?

Schema-on-read applies structure when data is queried, not when it’s stored. Schema-on-write defines structure before loading. Data lakes use schema-on-read; warehouses use schema-on-write.

2. What is a data lakehouse?

A lakehouse combines data lake flexibility with warehouse reliability by adding ACID transactions, schema enforcement, and performance optimizations (via Delta Lake, Iceberg, or Hudi) on top of object storage.

3. When would you choose a data lake over a data warehouse?

When data types are diverse (text, images, JSON), schemas are unknown/unstable, storage cost is a priority, or data scientists need raw data for ML exploration.

4. What is the medallion architecture?

A layered approach: Bronze (raw ingested data), Silver (cleaned/deduplicated), Gold (aggregated, business-ready). Each layer increases quality and reduces volume.

5. Challenge: Design a data lake strategy for a healthcare company collecting patient vitals from IoT devices, lab results as PDFs, appointment logs, and insurance claims.

Partition by source/year/month/day. Store IoT data as Parquet (time-series optimized), PDFs as raw objects, appointment logs as JSON, claims as Delta tables for ACID compliance. Bronze = all raw; Silver = parsed/validated; Gold = joined patient records with PII restricted.

Mini Project: Medallion Architecture Simulator

# medallion_architecture.py
# Simulate Bronze → Silver → Gold transformation
import json
import hashlib

raw_events = [
    {"event": "login", "user": "alice", "ts": "2026-06-15T08:00:00", "ip": "192.168.1.1"},
    {"event": "purchase", "user": "alice", "ts": "2026-06-15T08:30:00", "amount": 49.99},
    {"event": "login", "user": "bob", "ts": "2026-06-15T09:00:00", "ip": "10.0.0.1"},
    {"event": "error", "user": None, "ts": "2026-06-15T09:15:00", "error": "null_pointer"},
    {"event": "purchase", "user": "bob", "ts": "2026-06-15T09:30:00", "amount": 199.99},
    {"event": "login", "user": "alice", "ts": "2026-06-15T10:00:00", "ip": "192.168.1.1"},
]

def bronze_zone(events):
    """Bronze: Raw ingestion with audit fields."""
    bronze = []
    for e in events:
        bronze.append({
            **e,
            "_ingested_at": "2026-06-15T10:00:00",
            "_source": "clickstream_api",
            "_row_hash": hashlib.md5(json.dumps(e, sort_keys=True).encode()).hexdigest()[:8],
        })
    return bronze

def silver_zone(bronze):
    """Silver: Cleaned, deduplicated, validated."""
    seen = set()
    silver = []
    for row in bronze:
        if row["_row_hash"] in seen:
            continue
        seen.add(row["_row_hash"])
        if row["user"] is None:
            continue
        silver.append(row)
    return silver

def gold_zone(silver):
    """Gold: Aggregated, business-ready."""
    user_metrics = {}
    for row in silver:
        user = row["user"]
        if user not in user_metrics:
            user_metrics[user] = {"logins": 0, "purchases": 0, "total_spent": 0.0}
        if row["event"] == "login":
            user_metrics[user]["logins"] += 1
        elif row["event"] == "purchase":
            user_metrics[user]["purchases"] += 1
            user_metrics[user]["total_spent"] += row["amount"]
    return user_metrics

b = bronze_zone(raw_events)
s = silver_zone(b)
g = gold_zone(s)

print("=== Bronze Zone ===")
print(f"Events: {len(b)}")
print(json.dumps(b, indent=2))

print("\n=== Silver Zone ===")
print(f"Events: {len(s)} (deduped + null user filtered)")
for row in s:
    print(f"  {row['event']:<12} {row['user']:<8} {row.get('amount', '-')}")

print("\n=== Gold Zone (User Metrics) ===")
for user, metrics in g.items():
    print(f"  {user}: {metrics['logins']} logins, {metrics['purchases']} purchases, ${metrics['total_spent']:.2f} spent")

Expected output:

=== Bronze Zone ===
Events: 6
[
  {"event": "login", "user": "alice", "ts": "2026-06-15T08:00:00", "_ingested_at": "...", "_row_hash": "a1b2c3d4"},
  ...
]

=== Silver Zone ===
Events: 4 (deduped + null user filtered)
  login        alice     -
  purchase     alice     49.99
  login        bob       -
  purchase     bob       199.99

=== Gold Zone (User Metrics) ===
  alice: 2 logins, 1 purchases, $49.99 spent
  bob: 1 logins, 1 purchases, $199.99 spent

Related Concepts

What’s Next

You now understand data lakes and lakehouse architecture! Next, learn how Apache Spark processes data lake data at scale, and explore stream processing for real-time data ingestion into lakes.

  • Practice daily — Set up Bronze/Silver/Gold folders for your personal data
  • Build a project — Use AWS S3 or MinIO (local) to create a small data lake
  • Explore related topics — Check out Delta Lake documentation for ACID on lakes

Remember: every expert was once a beginner. Keep coding!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro