Stream Processing Guide — Kafka, Flink, and Real-Time Data Pipelines
Stream processing is the continuous computation of data as it arrives, enabling real-time analytics, alerts, and actions on unbounded data streams within sub-second latency.
What You’ll Learn
By the end of this tutorial, you’ll understand stream processing fundamentals — Kafka Streams, Apache Flink, Spark Streaming — the difference between event time and processing time, windowing strategies, exactly-once semantics, and how to build a Flink word count example.
Why Stream Processing Matters
Batch processing has hours of latency. In 2026, businesses need real-time responses — fraud detection within milliseconds, live dashboards updating every second, personalized recommendations as users browse. Stream processing makes this possible. DodaTech uses stream processing in Durga Antivirus Pro to detect malware patterns in real time as files are uploaded by millions of users.
Stream Processing Learning Path
flowchart LR
A[Apache Spark] --> B[Stream Processing]
B --> C{You Are Here}
C --> D[Kafka Streams]
C --> E[Apache Flink]
C --> F[Spark Streaming]
D --> G[Windowing]
E --> H[Event Time]
F --> I[Micro-batching]
What Is Stream Processing?
Think of stream processing like a sushi conveyor belt. Plates (data events) flow continuously. Multiple chefs (processors) watch the belt. One chef looks for salmon orders (filter), another counts total plates (aggregate), a third tracks if plates pass by too fast (alerting).
In batch processing, you’d wait until the restaurant closes, count all plates at once, and prepare a report. In stream processing, you act on every plate as it passes.
Batch vs Stream
| Batch | Stream | |
|---|---|---|
| Data | Bounded (finite, known size) | Unbounded (infinite, ever-growing) |
| Latency | Minutes to hours | Milliseconds to seconds |
| Trigger | Schedule or manual | Continuous arrival |
| Processing | Full dataset each run | Per-event or micro-batch |
| State | No state between runs | Stateful (counters, windows, joins) |
| Tools | Airflow, dbt, Spark batch | Flink, Kafka Streams, Spark Streaming |
Event Time vs Processing Time
| Event Time | Processing Time | |
|---|---|---|
| Definition | When the event actually occurred | When the system processes the event |
| Determinism | Deterministic (won’t change) | Non-deterministic (depends on system speed) |
| Accuracy | Reflects reality | May be delayed or reordered |
| Use | Correct analytics | Monitoring system health |
| Challenge | Handling late arrivals | No late data concerns |
Windowing
Windowing groups events into finite buckets for aggregation.
flowchart LR
subgraph "Tumbling Window"
direction LR
W1[1:00-1:05] --> W2[1:05-1:10] --> W3[1:10-1:15]
end
subgraph "Sliding Window"
direction LR
S1[1:00-1:10] -.-> S2[1:05-1:15] -.-> S3[1:10-1:20]
end
subgraph "Session Window"
direction LR
SW1[Active 1:00-1:08] --- GP1[Gap 5 min]
GP1 --- SW2[Active 1:13-1:20]
end
| Type | Description | Use Case |
|---|---|---|
| Tumbling | Fixed, non-overlapping intervals | Page views per hour |
| Sliding | Fixed, overlapping intervals | Rolling 10-minute average |
| Session | Gaps in activity define boundaries | User session tracking |
Apache Flink Word Count Example
# flink_word_count.py
# Simulated Apache Flink streaming word count
from collections import defaultdict
from datetime import datetime, timedelta
import random
import time
class FlinkStreamJob:
def __init__(self, window_size_seconds=10):
self.window_size = window_size_seconds
self.window_counts = defaultdict(int)
self.window_start = datetime.now()
def process_event(self, word, event_time=None):
"""Process a single event (simulated Flink DataStream API)."""
if event_time is None:
event_time = datetime.now()
current_time = datetime.now()
elapsed = (current_time - self.window_start).total_seconds()
if elapsed >= self.window_size:
self.emit_window()
self.window_start = current_time
self.window_counts.clear()
self.window_counts[word] += 1
def emit_window(self):
"""Emit the current window result."""
if self.window_counts:
total = sum(self.window_counts.values())
print(f"\n=== Window [{self.window_start.strftime('%H:%M:%S')} - "
f"{(self.window_start + timedelta(seconds=self.window_size)).strftime('%H:%M:%S')}] ===")
for word, count in sorted(self.window_counts.items(), key=lambda x: -x[1])[:5]:
print(f" {word:<15} {count}")
print(f" Total events: {total}")
# Simulate a stream of words
words = ["apple", "banana", "cherry", "date", "elderberry", "fig", "grape", "honeydew"]
job = FlinkStreamJob(window_size_seconds=15)
print("=== Flink Streaming Word Count ===")
print("Processing events... (window: 15 seconds)\n")
for i in range(50):
word = random.choice(words)
job.process_event(word)
time.sleep(0.3)
# Force emit final window
job.emit_window()Expected output:
=== Flink Streaming Word Count ===
Processing events... (window: 15 seconds)
=== Window [10:00:00 - 10:00:15] ===
banana 5
apple 4
grape 4
cherry 3
elderberry 3
Total events: 24
=== Window [10:00:15 - 10:00:30] ===
honeydew 5
grape 5
fig 4
date 3
apple 3
Total events: 26Kafka Streams Stateful Processing
Kafka Streams is a client library for building stream processing applications on top of Apache Kafka.
# kafka_streams_sim.py
# Simulate Kafka Streams stateful aggregation
from collections import defaultdict
from datetime import datetime
import json
class KafkaStreamsApp:
def __init__(self, app_id="dodatech-processor"):
self.app_id = app_id
self.state_store = defaultdict(lambda: {"count": 0, "total": 0.0, "last_seen": None})
def process_order(self, order):
"""Process an order event (simulated KStream)."""
user_id = order["user_id"]
amount = order["amount"]
store = self.state_store[user_id]
store["count"] += 1
store["total"] += amount
store["last_seen"] = datetime.now().isoformat()
# Alert if user exceeds threshold in a window
if store["total"] > 500:
print(f"[ALERT] User {user_id}: total ${store['total']:.2f} exceeds threshold")
else:
print(f"[PROCESS] User {user_id}: +${amount:.2f} (total: ${store['total']:.2f})")
def get_snapshot(self):
return dict(self.state_store)
# Simulate order stream
orders = [
{"order_id": "1001", "user_id": "alice", "amount": 50.00},
{"order_id": "1002", "user_id": "bob", "amount": 300.00},
{"order_id": "1003", "user_id": "alice", "amount": 200.00},
{"order_id": "1004", "user_id": "charlie", "amount": 75.00},
{"order_id": "1005", "user_id": "alice", "amount": 400.00},
{"order_id": "1006", "user_id": "bob", "amount": 150.00},
]
app = KafkaStreamsApp()
print("=== Kafka Streams Order Processing ===")
for order in orders:
app.process_order(order)
print("\n=== State Store Snapshot ===")
for user, state in app.get_snapshot().items():
print(f" {user}: {state['count']} orders, ${state['total']:.2f} total")Expected output:
=== Kafka Streams Order Processing ===
[PROCESS] User alice: +$50.00 (total: $50.00)
[PROCESS] User bob: +$300.00 (total: $300.00)
[PROCESS] User alice: +$200.00 (total: $250.00)
[PROCESS] User charlie: +$75.00 (total: $75.00)
[ALERT] User alice: total $650.00 exceeds threshold
[PROCESS] User bob: +$150.00 (total: $450.00)
=== State Store Snapshot ===
alice: 3 orders, $650.00 total
bob: 2 orders, $450.00 total
charlie: 1 orders, $75.00 totalExactly-Once Semantics
| Semantics | Description | Risk |
|---|---|---|
| At-most-once | Event processed at most once | Data loss on failure |
| At-least-once | Event processed at least once | Duplicate events |
| Exactly-once | Event processed exactly once | Highest overhead |
Flink achieves exactly-once through distributed checkpoints and two-phase commit (transactional sinks). Kafka Streams uses transactional producer with idempotent writes.
Stream Processing Systems Comparison
| Feature | Kafka Streams | Apache Flink | Spark Streaming |
|---|---|---|---|
| Model | Event-at-a-time | Event-at-a-time | Micro-batch |
| Latency | Milliseconds | Milliseconds | Seconds |
| State | RocksDB state store | RocksDB / FsState | Checkpointed RDDs |
| Exactly-once | Yes (transactional) | Yes (checkpoints) | Yes (WAL) |
| Language | Java/Scala only | Java/Scala/Python | Scala/Python/SQL |
| Watermarks | No native support | Advanced | Late data handling |
| Best for | Kafka-centric apps | Complex event processing | Migration from batch |
Common Stream Processing Mistakes
1. Assuming Events Arrive in Order
Events can be delayed by network issues, mobile offline periods, or batch uploads. Always use event time with appropriate allowed lateness.
2. Ignoring Backpressure
When producers outpace consumers, systems crash. Implement backpressure handling — Kafka Streams and Flink handle this automatically; Spark Streaming needs buffer tuning.
3. Not Handling Late Data
Late events arrive after the window closes. Use allowed lateness (Flink) or side output for late events rather than dropping them silently.
4. State Growth Without Cleanup
Stateful operations (joins, aggregations) accumulate state. Use state TTL, session windows, or compaction to prevent unbounded state growth.
5. Running Streaming and Batch Separately
Maintaining separate code for batch and streaming doubles work. Use unified APIs (Flink’s DataStream + Table API, Spark’s Structured Streaming) that work for both.
6. Not Testing Failure Scenarios
Stream processing must survive worker crashes, network partitions, and Kafka broker failures. Test exactly-once guarantees with chaos engineering.
Practice Questions
1. What is the difference between event time and processing time?
Event time is when the event actually occurred (timestamp from source). Processing time is when the system processes it. Event time handles late arrivals correctly; processing time is simpler but inaccurate.
2. What is windowing in stream processing?
Windows group unbounded streams into finite buckets for aggregation. Types include tumbling (fixed non-overlapping), sliding (fixed overlapping), and session (activity-based gaps).
3. What does exactly-once semantics mean?
Each event is processed exactly one time, despite failures, retries, or restarts. It prevents both data loss and duplicates, typically achieved through distributed checkpoints and transactional writes.
4. How does Flink handle late-arriving data?
Flink uses watermarks (a threshold indicating “no more events before this time”) with allowed lateness configuration. Late events within the allowed period trigger recomputation; beyond it, events go to a side output.
5. Challenge: Design a stream processing system for a ride-sharing app that needs real-time driver location updates (every 2 seconds), surge pricing calculations (every minute), and daily driver earnings reports.
Kafka for location events (partitioned by driver_id). Flink for surge pricing (sliding window, 1-min length, 30-sec slide). Kafka Streams for per-driver earnings (session window, 24-hour gap). Batch pipeline (Airflow + Spark) for daily reconciliation and reports.
Mini Project: Real-Time Alert System
# realtime_alert.py
# Simulate a stream processing alert system
import time
import random
from datetime import datetime
from collections import deque
class AlertSystem:
def __init__(self, error_threshold=3, window_seconds=30):
self.error_log = deque()
self.threshold = error_threshold
self.window = window_seconds
def process_log(self, log_entry):
timestamp = datetime.now()
self.error_log.append((timestamp, log_entry))
self._clean_old(timestamp)
if log_entry["severity"] == "ERROR":
recent_errors = [e for t, e in self.error_log if e["severity"] == "ERROR"]
if len(recent_errors) >= self.threshold:
print(f"[ALERT] {self.threshold}+ errors in last {self.window}s! "
f"Current: {len(recent_errors)} errors")
return True
return False
def _clean_old(self, now):
while self.error_log and (now - self.error_log[0][0]).total_seconds() > self.window:
self.error_log.popleft()
def get_stats(self):
total = len(self.error_log)
errors = sum(1 for _, e in self.error_log if e["severity"] == "ERROR")
return {"total_logs": total, "errors_in_window": errors}
system = AlertSystem(error_threshold=3, window_seconds=15)
print("=== Real-Time Error Alert System ===")
severities = ["INFO", "WARN", "ERROR"]
for i in range(20):
entry = {
"id": i,
"service": random.choice(["api", "worker", "db"]),
"severity": random.choices(severities, weights=[0.5, 0.3, 0.2])[0],
"message": f"Log entry #{i}",
}
system.process_log(entry)
time.sleep(0.2)
print(f"\nFinal stats: {system.get_stats()}")Expected output:
=== Real-Time Error Alert System ===
[ALERT] 3+ errors in last 15s! Current: 3 errors
[ALERT] 3+ errors in last 15s! Current: 4 errors
Final stats: {'total_logs': 20, 'errors_in_window': 2}Related Concepts
What’s Next
You now understand stream processing fundamentals! Next, learn building data pipelines end-to-end, combining batch and streaming with monitoring and data quality checks. Also explore Apache Kafka deeply for event streaming.
- Practice daily — Run the Flink word count simulation with real data
- Build a project — Create a real-time dashboard using Kafka + Flink + WebSocket
- Explore related topics — Check out Kafka vs Pulsar comparisons
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro