Real-Time Data Pipelines -- Kafka, Flink, Spark Streaming & Analytics
In this tutorial, you'll learn about Real. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Real-time Data Pipelines Process and analyze events as they arrive using Apache Kafka for ingestion, Apache Flink or Spark Streaming for Stream Processing, and ClickHouse for real-time analytics queries.
What You'll Learn
In this tutorial, you will learn how to design and deploy a real-time analytics pipeline that ingests events from web applications via Kafka, processes them with Flink or Spark Streaming, and serves low-latency queries through ClickHouse for live dashboards.
Why It Matters
Batch processing with daily exports is too slow for modern applications. Fraud detection, recommendation engines, monitoring alerts, and live dashboards all require sub-second latency from event to insight. Real-time pipelines turn data into decisions instantly.
Real-World Use
Doda Browser sends anonymized usage events to a Kafka cluster at 50K events per second. Flink processes the stream to update session State, calculate aggregations, and write to ClickHouse. The operations team sees live browser performance metrics with LESS than 2 seconds of end-to-end latency.
Real-Time Pipeline Architecture
flowchart LR
A[Web App Events] --> B[Kafka Producer]
B --> C[Kafka Cluster]
C --> D[Flink / Spark Streaming]
D --> E[Stream Processing]
E --> F[ClickHouse]
E --> G[Alerting]
F --> H[Real-Time Dashboard]
C --> I[Kafka Connect]
I --> J[(Data Lake)]
Kafka Producer for Analytics Events
Send events from a web application to Kafka:
import org.Apache.Kafka.clients.producer.*;
import Java.util.Properties;
public class AnalyticsProducer {
public static void main(String[] args) {
Properties props = new Properties();
props.put("Bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.Apache.Kafka.common.Serialization.StringSerializer");
props.put("value.serializer", "org.Apache.Kafka.common.Serialization.StringSerializer");
props.put("acks", "all");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
String event = "{\"event\":\"page_view\",\"user_id\":\"abc123\",\"page\":\"/docs\",\"timestamp\":\"2026-06-22T10:00:00Z\"}";
ProducerRecord<String, String> record = new ProducerRecord<>("analytics-events", event);
producer.send(record, (metadata, exception) -> {
if (exception != null) {
System.err.println("Send failed: " + exception.getMessage());
} else {
System.out.println("Sent to partition " + metadata.partition() + " offset " + metadata.offset());
}
});
producer.close();
}
}
Expected output:
Sent to partition 3 offset 142857
Stream Processing with Flink
Process the event stream with Apache Flink:
import org.Apache.flink.streaming.API.environment.StreamExecutionEnvironment;
import org.Apache.flink.streaming.API.datastream.DataStream;
import org.Apache.flink.API.common.functions.MapFunction;
public class AnalyticsStreamProcessor {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> events = env
.addSource(new FlinkKafkaConsumer<>("analytics-events",
new SimpleStringSchema(), properties));
DataStream<PageViewAgg> pageViews = events
.map(event -> parseEvent(event))
.keyBy(PageView::getPage)
.timeWindow(Time.minutes(1))
.aggregate(new PageViewAggregator());
pageViews
.addSink(new ClickHouseSink<>("JDBC:clickhouse://localhost:8123/analytics"));
env.execute("Analytics Stream Processor");
}
}
Expected behavior: Flink reads from the analytics-events Kafka topic, aggregates page views per page in 1-minute tumbling Windows, and writes the results to ClickHouse for dashboard queries.
Real-Time Aggregation with Spark Streaming
Equivalent processing with Spark Structured Streaming:
from pyspark.SQL import SparkSession
from pyspark.SQL.functions import window, count
Spark = SparkSession.Builder.appName("AnalyticsStream").getOrCreate()
events = Spark \
.readStream \
.format("Kafka") \
.option("Kafka.Bootstrap.servers", "localhost:9092") \
.option("subscribe", "analytics-events") \
.load()
page_views = events \
.selectExpr("CAST(value AS STRING) as JSON") \
.selectExpr("JSON_tuple(JSON, 'page', 'timestamp') as (page, ts)") \
.withWatermark("ts", "10 seconds") \
.groupBy(window("ts", "1 minute"), "page") \
.agg(count("*").alias("views"))
query = page_views \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
query.awaitTermination()
Expected output:
Batch: 1
+--------------------+-------------+-----+
|window |page |views|
+--------------------+-------------+-----+
|{2026-06-22 10:01..}|/docs |142 |
|{2026-06-22 10:01..}|/pricing |89 |
|{2026-06-22 10:01..}|/signup |56 |
+--------------------+-------------+-----+
Serving Real-Time Queries with ClickHouse
ClickHouse provides sub-second aggregation queries on streaming data:
-- Real-time dashboard query: page views in last 5 minutes by page
SELECT
page,
count() AS views,
uniqExact(user_id) AS unique_visitors
FROM analytics.page_views
WHERE timestamp >= now() - INTERVAL 5 MINUTE
GROUP BY page
ORDER BY views DESC
LIMIT 10;
Expected output:
| page | views | unique_visitors |
|---|---|---|
| /docs | 1230 | 890 |
| /pricing | 540 | 410 |
| /signup | 320 | 290 |
Tool Comparison
| Feature | Apache Kafka | Apache Flink | Spark Streaming | ClickHouse |
|---|---|---|---|---|
| Role | Event ingestion | Stream Processing | Stream Processing | Real-time OLAP |
| Latency | < 10ms | < 100ms | < 1s | < 10ms queries |
| State management | None | Managed State | Checkpointing | Materialized views |
| Exactly-once | Yes | Yes | Yes | N/A (query) |
| Learning curve | Medium | Steep | Medium | Low |
| Deployment | JVM cluster | JVM cluster | JVM cluster | Single binary |
Common Errors
1. Kafka Producer Without acks=all
Without acks=all, the producer considers a message sent when the leader acknowledges, not when all replicas have it. This risks data loss on leader failure.
2. Flink Checkpointing Not Configured
Without checkpointing, Flink loses all State on failure. Enable checkpointing with env.enableCheckpointing(60000) for exactly-once guarantees.
3. Spark Watermark Mismatch
Spark Streaming uses watermarks to handle late data. If the watermark delay is shorter than actual event lateness, late events are silently dropped. Set watermarks based on observed latency.
4. ClickHouse Not Using MergeTree Engine
The default TinyLog engine does not support efficient aggregations. Always use the MergeTree family with an appropriate ordering key for analytics workloads.
5. Ignoring Kafka Partition Count
Processing throughput is limited by the number of partitions. A topic with 3 partitions can only be consumed by 3 Flink/Spark tasks in parallel. Plan partition count based on expected throughput.
Practice Questions
1. What role does Kafka play in a real-time pipeline? Kafka acts as the durable event log and Message Broker. Producers write events to topics, and consumers (Flink, Spark) read from them. Kafka decouples data production from consumption.
2. How does Flink manage State across failures? Flink takes periodic checkpoints of operator State to a durable backend (HDFS, S3). On failure, it restores from the latest checkpoint and replays events from the Kafka offset.
3. Why use ClickHouse instead of PostgreSQL for real-time analytics? ClickHouse uses columnar storage and vectorized query execution, enabling sub-second aggregations on billions of rows. PostgreSQL is row-oriented and slower for analytical queries.
4. What is the difference between tumbling and sliding Windows? Tumbling Windows are fixed non-overlapping intervals (e.g., every minute). Sliding Windows overlap and emit results more frequently (e.g., window of 5 minutes sliding every 10 seconds).
5. Challenge: Deploy a real-time analytics pipeline with Kafka for ingestion, Flink for Stream Processing (1-minute aggregations), and ClickHouse for serving. Generate 1000 events per second and verify end-to-end latency is under 3 seconds.
Mini Project
Build a real-time anomaly detection pipeline for website traffic. Stream page view events through Kafka, use Flink to calculate a rolling average of requests per second over 10-minute Windows, and alert if the current rate deviates by more than 3 standard deviations from the rolling average. Store anomalies in ClickHouse and display on a real-time Grafana dashboard.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro