Real-Time Data Pipelines -- Kafka, Flink, Spark Streaming & Analytics

DodaTech Updated 2026-06-22 5 min read

In this tutorial, you'll learn about Real. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Real-time Data Pipelines Process and analyze events as they arrive using Apache Kafka for ingestion, Apache Flink or Spark Streaming for Stream Processing, and ClickHouse for real-time analytics queries.

What You'll Learn

In this tutorial, you will learn how to design and deploy a real-time analytics pipeline that ingests events from web applications via Kafka, processes them with Flink or Spark Streaming, and serves low-latency queries through ClickHouse for live dashboards.

Why It Matters

Batch processing with daily exports is too slow for modern applications. Fraud detection, recommendation engines, monitoring alerts, and live dashboards all require sub-second latency from event to insight. Real-time pipelines turn data into decisions instantly.

Real-World Use

Doda Browser sends anonymized usage events to a Kafka cluster at 50K events per second. Flink processes the stream to update session State, calculate aggregations, and write to ClickHouse. The operations team sees live browser performance metrics with LESS than 2 seconds of end-to-end latency.

Real-Time Pipeline Architecture

flowchart LR
    A[Web App Events] --> B[Kafka Producer]
    B --> C[Kafka Cluster]
    C --> D[Flink / Spark Streaming]
    D --> E[Stream Processing]
    E --> F[ClickHouse]
    E --> G[Alerting]
    F --> H[Real-Time Dashboard]
    C --> I[Kafka Connect]
    I --> J[(Data Lake)]

Kafka Producer for Analytics Events

Send events from a web application to Kafka:

import org.Apache.Kafka.clients.producer.*;
import Java.util.Properties;

public class AnalyticsProducer {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("Bootstrap.servers", "localhost:9092");
        props.put("key.serializer", "org.Apache.Kafka.common.Serialization.StringSerializer");
        props.put("value.serializer", "org.Apache.Kafka.common.Serialization.StringSerializer");
        props.put("acks", "all");

        KafkaProducer<String, String> producer = new KafkaProducer<>(props);

        String event = "{\"event\":\"page_view\",\"user_id\":\"abc123\",\"page\":\"/docs\",\"timestamp\":\"2026-06-22T10:00:00Z\"}";
        ProducerRecord<String, String> record = new ProducerRecord<>("analytics-events", event);

        producer.send(record, (metadata, exception) -> {
            if (exception != null) {
                System.err.println("Send failed: " + exception.getMessage());
            } else {
                System.out.println("Sent to partition " + metadata.partition() + " offset " + metadata.offset());
            }
        });
        producer.close();
    }
}

Expected output:

Sent to partition 3 offset 142857

Stream Processing with Flink

Process the event stream with Apache Flink:

import org.Apache.flink.streaming.API.environment.StreamExecutionEnvironment;
import org.Apache.flink.streaming.API.datastream.DataStream;
import org.Apache.flink.API.common.functions.MapFunction;

public class AnalyticsStreamProcessor {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream<String> events = env
            .addSource(new FlinkKafkaConsumer<>("analytics-events", 
                new SimpleStringSchema(), properties));

        DataStream<PageViewAgg> pageViews = events
            .map(event -> parseEvent(event))
            .keyBy(PageView::getPage)
            .timeWindow(Time.minutes(1))
            .aggregate(new PageViewAggregator());

        pageViews
            .addSink(new ClickHouseSink<>("JDBC:clickhouse://localhost:8123/analytics"));

        env.execute("Analytics Stream Processor");
    }
}

Expected behavior: Flink reads from the analytics-events Kafka topic, aggregates page views per page in 1-minute tumbling Windows, and writes the results to ClickHouse for dashboard queries.

Real-Time Aggregation with Spark Streaming

Equivalent processing with Spark Structured Streaming:

from pyspark.SQL import SparkSession
from pyspark.SQL.functions import window, count

Spark = SparkSession.Builder.appName("AnalyticsStream").getOrCreate()

events = Spark \
    .readStream \
    .format("Kafka") \
    .option("Kafka.Bootstrap.servers", "localhost:9092") \
    .option("subscribe", "analytics-events") \
    .load()

page_views = events \
    .selectExpr("CAST(value AS STRING) as JSON") \
    .selectExpr("JSON_tuple(JSON, 'page', 'timestamp') as (page, ts)") \
    .withWatermark("ts", "10 seconds") \
    .groupBy(window("ts", "1 minute"), "page") \
    .agg(count("*").alias("views"))

query = page_views \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

query.awaitTermination()

Expected output:

Batch: 1
+--------------------+-------------+-----+
|window              |page         |views|
+--------------------+-------------+-----+
|{2026-06-22 10:01..}|/docs        |142  |
|{2026-06-22 10:01..}|/pricing     |89   |
|{2026-06-22 10:01..}|/signup      |56   |
+--------------------+-------------+-----+

Serving Real-Time Queries with ClickHouse

ClickHouse provides sub-second aggregation queries on streaming data:

-- Real-time dashboard query: page views in last 5 minutes by page
SELECT
  page,
  count() AS views,
  uniqExact(user_id) AS unique_visitors
FROM analytics.page_views
WHERE timestamp >= now() - INTERVAL 5 MINUTE
GROUP BY page
ORDER BY views DESC
LIMIT 10;

Expected output:

page	views	unique_visitors
/docs	1230	890
/pricing	540	410
/signup	320	290

Tool Comparison

Feature	Apache Kafka	Apache Flink	Spark Streaming	ClickHouse
Role	Event ingestion	Stream Processing	Stream Processing	Real-time OLAP
Latency	< 10ms	< 100ms	< 1s	< 10ms queries
State management	None	Managed State	Checkpointing	Materialized views
Exactly-once	Yes	Yes	Yes	N/A (query)
Learning curve	Medium	Steep	Medium	Low
Deployment	JVM cluster	JVM cluster	JVM cluster	Single binary

Common Errors

1. Kafka Producer Without acks=all

Without acks=all, the producer considers a message sent when the leader acknowledges, not when all replicas have it. This risks data loss on leader failure.

2. Flink Checkpointing Not Configured

Without checkpointing, Flink loses all State on failure. Enable checkpointing with env.enableCheckpointing(60000) for exactly-once guarantees.

3. Spark Watermark Mismatch

Spark Streaming uses watermarks to handle late data. If the watermark delay is shorter than actual event lateness, late events are silently dropped. Set watermarks based on observed latency.

4. ClickHouse Not Using MergeTree Engine

The default TinyLog engine does not support efficient aggregations. Always use the MergeTree family with an appropriate ordering key for analytics workloads.

5. Ignoring Kafka Partition Count

Processing throughput is limited by the number of partitions. A topic with 3 partitions can only be consumed by 3 Flink/Spark tasks in parallel. Plan partition count based on expected throughput.

Practice Questions

1. What role does Kafka play in a real-time pipeline? Kafka acts as the durable event log and Message Broker. Producers write events to topics, and consumers (Flink, Spark) read from them. Kafka decouples data production from consumption.

2. How does Flink manage State across failures? Flink takes periodic checkpoints of operator State to a durable backend (HDFS, S3). On failure, it restores from the latest checkpoint and replays events from the Kafka offset.

3. Why use ClickHouse instead of PostgreSQL for real-time analytics? ClickHouse uses columnar storage and vectorized query execution, enabling sub-second aggregations on billions of rows. PostgreSQL is row-oriented and slower for analytical queries.

4. What is the difference between tumbling and sliding Windows? Tumbling Windows are fixed non-overlapping intervals (e.g., every minute). Sliding Windows overlap and emit results more frequently (e.g., window of 5 minutes sliding every 10 seconds).

5. Challenge: Deploy a real-time analytics pipeline with Kafka for ingestion, Flink for Stream Processing (1-minute aggregations), and ClickHouse for serving. Generate 1000 events per second and verify end-to-end latency is under 3 seconds.

Mini Project

Build a real-time anomaly detection pipeline for website traffic. Stream page view events through Kafka, use Flink to calculate a rolling average of requests per second over 10-minute Windows, and alert if the current rate deviates by more than 3 standard deviations from the rolling average. Store anomalies in ClickHouse and display on a real-time Grafana dashboard.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Web Analytics Best Practices -- Metrics, Dashboards & Reporting Next → A/B Testing & Experimentation for Developers -- Statistical Methods & Implementation

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Analytics