Data Lakes Explained — Lakehouse Architecture and Schema-on-Read
A data lake is a centralized repository that stores all data — structured, semi-structured, and unstructured — in its raw, original format, using object storage and schema-on-read for flexible analysis.
What You’ll Learn
By the end of this tutorial, you’ll understand what data lakes are, how lakehouse architecture combines lake and warehouse benefits, schema-on-read vs schema-on-write, and when to use a data lake vs a data warehouse.
Why Data Lakes Matter
Modern organizations collect data from dozens of sources — databases, APIs, IoT sensors, logs, images, social media. A data warehouse requires you to define schemas and transform data before loading. Data lakes store everything raw, giving data scientists and analysts flexibility to explore without upfront modeling. DodaTech’s Durga Antivirus Pro uses a data lake to store raw malware samples and threat intelligence feeds.
Data Lakes Learning Path
flowchart LR
A[Data Warehousing] --> B[Data Lakes]
B --> C{You Are Here}
C --> D[Object Storage]
C --> E[Schema-on-Read]
C --> F[Lakehouse]
D --> G[S3]
D --> H[ADLS]
F --> I[Delta Lake]
F --> J[Apache Iceberg]
What Is a Data Lake?
Think of a data lake like a giant warehouse where you store every box as-is. You don’t sort, label, or organize boxes when they arrive. When someone needs to find something, they search relevant boxes and extract what they need at that moment.
A data warehouse is like a library — everything is cataloged, organized on shelves, and easy to find. But preparing items for the library takes time and you can’t add things that don’t fit the cataloging system.
| Data Lake | Data Warehouse | |
|---|---|---|
| Data format | Raw, native format | Processed, structured |
| Schema | Schema-on-read (apply at query time) | Schema-on-write (defined before load) |
| Storage cost | Low (object storage) | Higher (compute-optimized) |
| Data types | All types (text, JSON, images, video) | Structured and semi-structured |
| Users | Data scientists, data engineers | Analysts, business users |
| Agility | High — explore without modeling | Lower — schema changes are expensive |
Schema-on-Read vs Schema-on-Write
Schema-on-Write (Data Warehouse)
You define the schema before loading data. If a column is added in the source, you must alter the table before loading.
# Schema-on-write: Table must exist before loading
# CREATE TABLE sales (id INT, amount DECIMAL, date DATE);
# INSERT INTO sales VALUES (1, 100.50, '2026-06-01');
# A new column 'region' would require: ALTER TABLE sales ADD COLUMN region TEXT;Schema-on-Read (Data Lake)
You store raw data as-is (JSON, Parquet, CSV). The schema is applied when you query it.
# Schema-on-read: Store raw, interpret at query time
# Raw data in S3: events/2026/06/01/events.json
# Spark query:
# df = spark.read.json("s3://datalake/events/2026/06/01/")
# df.createOrReplaceTempView("events")
# spark.sql("SELECT event_type, COUNT(*) FROM events GROUP BY event_type").show()
# Schema is inferred from the data itself at read time.Lakehouse Architecture
The lakehouse combines the flexibility of data lakes with the reliability and performance of data warehouses.
flowchart TB
subgraph "Lakehouse Architecture"
direction TB
RAW[Raw Zone
S3/ADLS] --> STAGE[Stage Zone
Cleaned Parquet]
STAGE --> CURATED[Curated Zone
Delta Tables]
CURATED --> ML[ML Pipelines]
CURATED --> BI[BI Dashboards]
CURATED --> SQL[SQL Analytics]
end
style RAW fill:#ff6b6b,color:#fff
style STAGE fill:#feca57,color:#333
style CURATED fill:#48dbfb,color:#333
Key Lakehouse Technologies
| Technology | Company | Key Feature |
|---|---|---|
| Delta Lake | Databricks | ACID transactions on Parquet, time travel, schema enforcement |
| Apache Iceberg | Netflix/Apple | Table format for huge datasets, partition evolution |
| Apache Hudi | Uber | Incremental processing, upserts on data lakes |
Data Lake Storage Raw Data
# datalake_operations.py
# Simulate data lake storage and schema-on-read
import json
import os
from datetime import datetime
class DataLake:
def __init__(self, base_path="datalake"):
self.base_path = base_path
os.makedirs(f"{base_path}/raw", exist_ok=True)
os.makedirs(f"{base_path}/curated", exist_ok=True)
def ingest_raw(self, source, data, partition_by="date"):
"""Store data in raw format with partitioning."""
partition = datetime.now().strftime("%Y/%m/%d")
path = f"{self.base_path}/raw/{source}/{partition}"
os.makedirs(path, exist_ok=True)
filename = f"{datetime.now().strftime('%H%M%S')}.json"
with open(f"{path}/{filename}", 'w') as f:
json.dump(data, f)
print(f"[INGEST] Raw data -> {path}/{filename}")
return f"{path}/{filename}"
def read_raw(self, path):
"""Schema-on-read: apply interpretation at read time."""
with open(path) as f:
raw = json.load(f)
# Schema is applied at read time — not on ingest
schema = {
"event": str,
"user_id": str,
"value": float,
"tags": list,
}
validated = {}
for field, expected_type in schema.items():
value = raw.get(field)
if value is not None:
try:
validated[field] = expected_type(value)
except (ValueError, TypeError):
validated[field] = None
else:
validated[field] = None
print(f"[READ] Schema-on-read applied: {validated}")
return validated
# Simulate ingestion of different data types
lake = DataLake()
# Ingest clickstream event
click_event = {
"event": "page_view",
"user_id": "user_12345",
"value": 1.0,
"tags": ["homepage", "organic"],
"browser": "Chrome", # Extra field — not in schema, stored anyway
"ip": "192.168.1.1", # Another extra field
}
path = lake.ingest_raw("clickstream", click_event)
# Ingest sales event
sale_event = {
"event": "purchase",
"user_id": "user_12345",
"value": 149.99,
"tags": ["checkout", "promo_summer"],
"order_id": "ORD-2026-0001",
}
lake.ingest_raw("sales", sale_event)
print("\n=== Schema-on-Read for Clickstream ===")
lake.read_raw(path)
print("\n=== Exploring Raw Data Lake ===")
for root, dirs, files in os.walk(lake.base_path):
level = root.replace(lake.base_path, '').count(os.sep)
indent = ' ' * 2 * level
print(f"{indent}{os.path.basename(root)}/")
if files:
sub_indent = ' ' * 2 * (level + 1)
for file in files[:3]:
size = os.path.getsize(os.path.join(root, file))
print(f"{sub_indent}{file} ({size} bytes)")Expected output:
[INGEST] Raw data -> datalake/raw/clickstream/2026/06/15/100000.json
[INGEST] Raw data -> datalake/raw/sales/2026/06/15/100001.json
=== Schema-on-Read for Clickstream ===
[READ] Schema-on-read applied: {'event': 'page_view', 'user_id': 'user_12345', 'value': 1.0, 'tags': ['homepage', 'organic']}
=== Exploring Raw Data Lake ===
datalake/
raw/
clickstream/
2026/
06/
15/
100000.json (194 bytes)
sales/
2026/
06/
15/
100001.json (167 bytes)
curated/Data Lake vs Data Warehouse — When to Use Which
flowchart TD
Q[What are you doing with the data?]
Q --> A[Analysis / Reporting]
Q --> B[ML / Exploration]
Q --> C[Real-time / Streaming]
A --> W[Data Warehouse]
B --> L[Data Lake]
C --> L
subgraph "Hybrid: Lakehouse"
H[Both — use Delta/Iceberg]
end
| Scenario | Use Data Lake | Use Data Warehouse |
|---|---|---|
| Exploratory analysis on new data | ✓ | ✗ |
| ML model training with raw features | ✓ | ✗ |
| Structured BI reports for executives | ✗ | ✓ |
| Schema flexibility (unknown requirements) | ✓ | ✗ |
| Low storage cost for petabytes | ✓ | ✗ |
| Fast, consistent SQL for analysts | ✗ | ✓ |
| ACID transactions on data | Delta Lake | ✓ |
Common Data Lake Mistakes
1. Creating a “Data Swamp”
A data lake with no organization, no metadata, and no governance becomes a data swamp where nothing can be found. Implement partitioning, cataloging, and naming conventions from day one.
2. Not Managing Permissions
Raw data may contain PII, financial details, or credentials. Apply access controls at the storage level (IAM policies, bucket policies) and restrict who can read raw zones.
3. Ignoring Small Files Problem
Storing millions of tiny CSV files kills query performance. Coalesce small files into larger Parquet files (100MB-1GB) using Spark or scheduled compaction jobs.
4. Writing Without Schema Validation
With no schema-on-write, bad data can silently enter the lake. Write validation scripts or use tools like Great Expectations to catch corrupted records.
5. No Data Retention Policies
Raw data accumulates fast. Without lifecycle policies, storage costs explode. Move cold data to cheaper tiers (S3 Glacier, Azure Archive) and delete duplicates.
6. Using Only Raw Zone
A lake with only raw data is hard to use. Implement a medallion architecture: Bronze (raw), Silver (cleaned), Gold (aggregated/curated).
Practice Questions
1. What is schema-on-read and how is it different from schema-on-write?
Schema-on-read applies structure when data is queried, not when it’s stored. Schema-on-write defines structure before loading. Data lakes use schema-on-read; warehouses use schema-on-write.
2. What is a data lakehouse?
A lakehouse combines data lake flexibility with warehouse reliability by adding ACID transactions, schema enforcement, and performance optimizations (via Delta Lake, Iceberg, or Hudi) on top of object storage.
3. When would you choose a data lake over a data warehouse?
When data types are diverse (text, images, JSON), schemas are unknown/unstable, storage cost is a priority, or data scientists need raw data for ML exploration.
4. What is the medallion architecture?
A layered approach: Bronze (raw ingested data), Silver (cleaned/deduplicated), Gold (aggregated, business-ready). Each layer increases quality and reduces volume.
5. Challenge: Design a data lake strategy for a healthcare company collecting patient vitals from IoT devices, lab results as PDFs, appointment logs, and insurance claims.
Partition by source/year/month/day. Store IoT data as Parquet (time-series optimized), PDFs as raw objects, appointment logs as JSON, claims as Delta tables for ACID compliance. Bronze = all raw; Silver = parsed/validated; Gold = joined patient records with PII restricted.
Mini Project: Medallion Architecture Simulator
# medallion_architecture.py
# Simulate Bronze → Silver → Gold transformation
import json
import hashlib
raw_events = [
{"event": "login", "user": "alice", "ts": "2026-06-15T08:00:00", "ip": "192.168.1.1"},
{"event": "purchase", "user": "alice", "ts": "2026-06-15T08:30:00", "amount": 49.99},
{"event": "login", "user": "bob", "ts": "2026-06-15T09:00:00", "ip": "10.0.0.1"},
{"event": "error", "user": None, "ts": "2026-06-15T09:15:00", "error": "null_pointer"},
{"event": "purchase", "user": "bob", "ts": "2026-06-15T09:30:00", "amount": 199.99},
{"event": "login", "user": "alice", "ts": "2026-06-15T10:00:00", "ip": "192.168.1.1"},
]
def bronze_zone(events):
"""Bronze: Raw ingestion with audit fields."""
bronze = []
for e in events:
bronze.append({
**e,
"_ingested_at": "2026-06-15T10:00:00",
"_source": "clickstream_api",
"_row_hash": hashlib.md5(json.dumps(e, sort_keys=True).encode()).hexdigest()[:8],
})
return bronze
def silver_zone(bronze):
"""Silver: Cleaned, deduplicated, validated."""
seen = set()
silver = []
for row in bronze:
if row["_row_hash"] in seen:
continue
seen.add(row["_row_hash"])
if row["user"] is None:
continue
silver.append(row)
return silver
def gold_zone(silver):
"""Gold: Aggregated, business-ready."""
user_metrics = {}
for row in silver:
user = row["user"]
if user not in user_metrics:
user_metrics[user] = {"logins": 0, "purchases": 0, "total_spent": 0.0}
if row["event"] == "login":
user_metrics[user]["logins"] += 1
elif row["event"] == "purchase":
user_metrics[user]["purchases"] += 1
user_metrics[user]["total_spent"] += row["amount"]
return user_metrics
b = bronze_zone(raw_events)
s = silver_zone(b)
g = gold_zone(s)
print("=== Bronze Zone ===")
print(f"Events: {len(b)}")
print(json.dumps(b, indent=2))
print("\n=== Silver Zone ===")
print(f"Events: {len(s)} (deduped + null user filtered)")
for row in s:
print(f" {row['event']:<12} {row['user']:<8} {row.get('amount', '-')}")
print("\n=== Gold Zone (User Metrics) ===")
for user, metrics in g.items():
print(f" {user}: {metrics['logins']} logins, {metrics['purchases']} purchases, ${metrics['total_spent']:.2f} spent")Expected output:
=== Bronze Zone ===
Events: 6
[
{"event": "login", "user": "alice", "ts": "2026-06-15T08:00:00", "_ingested_at": "...", "_row_hash": "a1b2c3d4"},
...
]
=== Silver Zone ===
Events: 4 (deduped + null user filtered)
login alice -
purchase alice 49.99
login bob -
purchase bob 199.99
=== Gold Zone (User Metrics) ===
alice: 2 logins, 1 purchases, $49.99 spent
bob: 1 logins, 1 purchases, $199.99 spentRelated Concepts
What’s Next
You now understand data lakes and lakehouse architecture! Next, learn how Apache Spark processes data lake data at scale, and explore stream processing for real-time data ingestion into lakes.
- Practice daily — Set up Bronze/Silver/Gold folders for your personal data
- Build a project — Use AWS S3 or MinIO (local) to create a small data lake
- Explore related topics — Check out Delta Lake documentation for ACID on lakes
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro