Learn Big: Real-Time Analytics Explained — Streaming Architectures, Lambda vs Kappa & Dashboards

Real-Time Analytics Explained — Streaming Architectures, Lambda vs Kappa & Dashboards

DodaTech Updated Jun 15, 2026 6 min read

Real-time analytics processes data as it arrives — providing insights within seconds or milliseconds instead of hours.

What You’ll Learn

In this tutorial, you’ll learn streaming architectures (Lambda vs Kappa), real-time dashboards, alerting and anomaly detection, and how to build a real-time analytics pipeline with Kafka and ClickHouse.

Why It Matters

In a world moving at internet speed, waiting for batch reports means missed opportunities. Real-time analytics powers fraud detection (detect within milliseconds), operational monitoring (CPU spikes → auto-scale), and user-facing dashboards.

Real-World Use

When you check your DoorDash order status, the map updates in real time. This requires processing GPS events, updating order state, and pushing updates to your browser — all within seconds. Durga Antivirus Pro uses real-time analytics to detect threats as they emerge across millions of endpoints.


graph TD
  subgraph "Lambda Architecture"
    A[Data Stream] --> B[Speed Layer
Real-time]
    A --> C[Batch Layer
Historical]
    B --> D[Merged View]
    C --> D
  end
  subgraph "Kappa Architecture"
    E[Data Stream] --> F[Stream Processor]
    F --> G[Real-time View]
    F --> H[Long-term Storage]
  end
  style B fill:#4f46e5,color:#fff
  style F fill:#4f46e5,color:#fff

Lambda vs Kappa Architecture

Lambda Architecture

Two parallel pipelines: a speed layer for low-latency (seconds) and a batch layer for accurate, comprehensive results (hours). Results are merged in a serving layer.

Layer	Technology	Latency	Accuracy
Speed	Flink, Spark Streaming	Seconds	Approximate
Batch	Hadoop, Spark Batch	Hours	Exact
Serving	Druid, Cassandra	“Real-time” query	Merged

Problem: Maintaining two separate codebases that produce compatible results is complex and error-prone.

Kappa Architecture

A single stream processing pipeline handles everything. Stream processor maintains state for real-time results, and can reprocess historical data by replaying the stream.

Component	Technology
Message broker	Kafka (immutable log)
Stream processor	Flink, ksqlDB
Serving database	ClickHouse, Pinot, Druid
Dashboard	Grafana, Metabase, Superset

# Simulated Kappa architecture pipeline
import time
import random
import json
from datetime import datetime

class StreamEvent:
    def __init__(self, event_type, value):
        self.event_type = event_type
        self.value = value
        self.timestamp = datetime.now()

class KappaPipeline:
    def __init__(self):
        self.state = {}  # Key: metric name, Value: rolling stats
        self.stream = []

    def ingest(self, event):
        self.stream.append(event)
        key = event.event_type
        if key not in self.state:
            self.state[key] = {"count": 0, "sum": 0, "min": float("inf"), "max": float("-inf")}
        s = self.state[key]
        s["count"] += 1
        s["sum"] += event.value
        s["min"] = min(s["min"], event.value)
        s["max"] = max(s["max"], event.value)

    def get_stats(self, event_type):
        s = self.state.get(event_type)
        if not s:
            return None
        return {
            **s,
            "avg": round(s["sum"] / s["count"], 2) if s["count"] else 0,
        }

pipeline = KappaPipeline()

# Simulate streaming events
print("Ingesting events...")
for i in range(1, 101):
    event = StreamEvent(
        random.choice(["page_view", "purchase", "click"]),
        random.gauss(50, 20)
    )
    pipeline.ingest(event)

print("\nReal-time stats:")
for event_type in ["page_view", "purchase", "click"]:
    stats = pipeline.get_stats(event_type)
    print(f"  {event_type}: count={stats['count']}, avg=${stats['avg']}, max=${stats['max']}")

Expected output:

Ingesting events...

Real-time stats:
  page_view: count=35, avg=$49.23, max=$95.12
  purchase: count=28, avg=$52.87, max=$98.44
  click: count=37, avg=$48.15, max=$91.76

Real-Time Dashboard Architecture

A typical real-time dashboard stack:

Data Sources → Kafka → Stream Processor → Serving DB → Dashboard
    (apps)     (buffer)    (Flink/ksqlDB)   (ClickHouse)   (Grafana)

Example: Kafka + ClickHouse + Grafana

-- ClickHouse table for real-time analytics
CREATE TABLE events (
    event_time DateTime,
    event_type String,
    user_id String,
    value Float64
) ENGINE = MergeTree()
ORDER BY (event_time, event_type);

-- Materialized view for per-second aggregations
CREATE MATERIALIZED VIEW events_per_sec
ENGINE = SummingMergeTree()
ORDER BY (event_time, event_type)
AS SELECT
    toStartOfSecond(event_time) as event_time,
    event_type,
    count() as count,
    sum(value) as total_value
FROM events
GROUP BY event_time, event_type;

-- Query for dashboard (last 1 hour, refreshed every 5 seconds)
SELECT
    toStartOfMinute(event_time) as minute,
    event_type,
    sum(count) as events
FROM events_per_sec
WHERE event_time > now() - INTERVAL 1 HOUR
GROUP BY minute, event_type
ORDER BY minute;

Alerting and Anomaly Detection

Real-time alerts trigger when metrics cross thresholds or deviate from expected patterns.

import statistics

class AnomalyDetector:
    def __init__(self, window_size=20, threshold=3):
        self.values = []
        self.window_size = window_size
        self.threshold = threshold

    def check(self, value):
        self.values.append(value)
        if len(self.values) > self.window_size:
            self.values.pop(0)

        if len(self.values) < 5:
            return False  # Not enough data

        mean = statistics.mean(self.values)
        stdev = statistics.stdev(self.values)
        z_score = abs(value - mean) / max(stdev, 0.001)

        if z_score > self.threshold:
            return True, f"Anomaly: {value:.1f} (mean={mean:.1f}, z={z_score:.1f})"
        return False, f"Normal: {value:.1f} (mean={mean:.1f}, z={z_score:.1f})"

detector = AnomalyDetector(window_size=30, threshold=3)

# Simulate normal traffic with a spike
print("Monitoring for anomalies:")
for i in range(40):
    if 25 <= i <= 27:
        value = random.gauss(200, 30)  # Spike
    else:
        value = random.gauss(50, 10)   # Normal
    is_anomaly, msg = detector.check(value)
    if is_anomaly:
        print(f"  ⚠ {msg}")
    else:
        print(f"  {msg}")

Expected output (approximate):

Monitoring for anomalies:
  Normal: 52.3 (mean=52.3, z=0.0)
  Normal: 48.7 (mean=50.5, z=0.2)
  Normal: 55.1 (mean=52.0, z=0.3)
  ...
  ⚠ Anomaly: 215.3 (mean=51.8, z=14.5)
  ...
  Normal: 49.2 (mean=55.0, z=0.5)

Common Mistakes

Using Lambda architecture unnecessarily: Most use cases can be handled with Kappa (single pipeline). Lambda adds complexity.
Not planning for data replay: Real-time pipelines need to reprocess historical data when fixing bugs. Store raw data in Kafka with long retention.
Ignoring backpressure: If your stream processor can’t keep up with data ingestion, you need backpressure handling or auto-scaling.
Serving dashboard queries from the stream processor: Stream processors aren’t query engines. Use a serving database (ClickHouse, Pinot) for dashboard queries.
Sampling data for dashboards: Sampling hides the tail. Real-user monitoring needs full data to catch rare issues affecting individual users.

Practice Questions

What’s the difference between Lambda and Kappa architectures? Lambda uses two pipelines (speed + batch). Kappa uses one stream processing pipeline for everything, simplifying maintenance.
Why use a serving database like ClickHouse for dashboards? Stream processors are optimized for real-time computation, not sub-second query serving. ClickHouse is optimized for fast analytical queries.
What is anomaly detection in real-time analytics? Identifying data points that deviate significantly from historical patterns — used for fraud detection, system monitoring, and quality control.
How do you handle late-arriving data in real-time pipelines? Use event time processing with watermarks (Flink) or upsert semantics in the serving database.
What is backpressure? When a downstream component can’t keep up with the upstream data rate. Solutions include buffering, scaling, or dropping data.

Challenge

Build a real-time pipeline: produce events to Kafka → consume with Flink/Apify → aggregate per minute → write to ClickHouse → query with Grafana. Start with simulated data from a Python script.

Real-World Task

Open Grafana or a monitoring dashboard your team uses. How often does it refresh? Are metrics powered by real-time queries or pre-aggregated data?

Mini Project: Real-Time Website Monitor

Build a real-time dashboard for website traffic. Use Kafka to stream page views, a Python stream processor to aggregate per minute, and display via a WebSocket-based dashboard (or Grafana with a database).

Security angle: Real-time security analytics enable immediate response to threats — detecting brute-force attacks, data exfiltration, or compromised accounts as they happen.

What’s Next

Review: Apache Kafka

Review: Apache Flink

Review: Data Governance

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

What’s Next

Congratulations on completing this Real-Time Analytics tutorial! Here’s where to go from here:

Practice daily — Consistency is more important than long study sessions
Build a project — Apply what you learned by building something real
Explore related topics — Check out other tutorials in the same category
Join the community — Discuss with other learners and share your progress

Remember: every expert was once a beginner. Keep coding!

Previous Data Governance Explained — Catalogs, Lineage, Quality & GDPR Compliance Next Data Lake Architecture — Medallion, Delta Lake, Iceberg & Hudi

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Big Data & Analytics