Learn Stream Processing Guide — Kafka, Flink, and Real-Time Data Pipelines

Stream Processing Guide — Kafka, Flink, and Real-Time Data Pipelines

DodaTech Updated Jun 15, 2026 8 min read

Stream processing is the continuous computation of data as it arrives, enabling real-time analytics, alerts, and actions on unbounded data streams within sub-second latency.

What You’ll Learn

By the end of this tutorial, you’ll understand stream processing fundamentals — Kafka Streams, Apache Flink, Spark Streaming — the difference between event time and processing time, windowing strategies, exactly-once semantics, and how to build a Flink word count example.

Why Stream Processing Matters

Batch processing has hours of latency. In 2026, businesses need real-time responses — fraud detection within milliseconds, live dashboards updating every second, personalized recommendations as users browse. Stream processing makes this possible. DodaTech uses stream processing in Durga Antivirus Pro to detect malware patterns in real time as files are uploaded by millions of users.

Stream Processing Learning Path


flowchart LR
  A[Apache Spark] --> B[Stream Processing]
  B --> C{You Are Here}
  C --> D[Kafka Streams]
  C --> E[Apache Flink]
  C --> F[Spark Streaming]
  D --> G[Windowing]
  E --> H[Event Time]
  F --> I[Micro-batching]

Prerequisites: Understanding of data engineering basics, ETL pipelines, and Apache Spark. Java/Scala or Python familiarity helps.

What Is Stream Processing?

Think of stream processing like a sushi conveyor belt. Plates (data events) flow continuously. Multiple chefs (processors) watch the belt. One chef looks for salmon orders (filter), another counts total plates (aggregate), a third tracks if plates pass by too fast (alerting).

In batch processing, you’d wait until the restaurant closes, count all plates at once, and prepare a report. In stream processing, you act on every plate as it passes.

Batch vs Stream

	Batch	Stream
Data	Bounded (finite, known size)	Unbounded (infinite, ever-growing)
Latency	Minutes to hours	Milliseconds to seconds
Trigger	Schedule or manual	Continuous arrival
Processing	Full dataset each run	Per-event or micro-batch
State	No state between runs	Stateful (counters, windows, joins)
Tools	Airflow, dbt, Spark batch	Flink, Kafka Streams, Spark Streaming

Event Time vs Processing Time

	Event Time	Processing Time
Definition	When the event actually occurred	When the system processes the event
Determinism	Deterministic (won’t change)	Non-deterministic (depends on system speed)
Accuracy	Reflects reality	May be delayed or reordered
Use	Correct analytics	Monitoring system health
Challenge	Handling late arrivals	No late data concerns

Windowing

Windowing groups events into finite buckets for aggregation.


flowchart LR
  subgraph "Tumbling Window"
    direction LR
    W1[1:00-1:05] --> W2[1:05-1:10] --> W3[1:10-1:15]
  end
  subgraph "Sliding Window"
    direction LR
    S1[1:00-1:10] -.-> S2[1:05-1:15] -.-> S3[1:10-1:20]
  end
  subgraph "Session Window"
    direction LR
    SW1[Active 1:00-1:08] --- GP1[Gap 5 min]
    GP1 --- SW2[Active 1:13-1:20]
  end

Type	Description	Use Case
Tumbling	Fixed, non-overlapping intervals	Page views per hour
Sliding	Fixed, overlapping intervals	Rolling 10-minute average
Session	Gaps in activity define boundaries	User session tracking

Apache Flink Word Count Example

# flink_word_count.py
# Simulated Apache Flink streaming word count
from collections import defaultdict
from datetime import datetime, timedelta
import random
import time

class FlinkStreamJob:
    def __init__(self, window_size_seconds=10):
        self.window_size = window_size_seconds
        self.window_counts = defaultdict(int)
        self.window_start = datetime.now()

    def process_event(self, word, event_time=None):
        """Process a single event (simulated Flink DataStream API)."""
        if event_time is None:
            event_time = datetime.now()

        current_time = datetime.now()
        elapsed = (current_time - self.window_start).total_seconds()

        if elapsed >= self.window_size:
            self.emit_window()
            self.window_start = current_time
            self.window_counts.clear()

        self.window_counts[word] += 1

    def emit_window(self):
        """Emit the current window result."""
        if self.window_counts:
            total = sum(self.window_counts.values())
            print(f"\n=== Window [{self.window_start.strftime('%H:%M:%S')} - "
                  f"{(self.window_start + timedelta(seconds=self.window_size)).strftime('%H:%M:%S')}] ===")
            for word, count in sorted(self.window_counts.items(), key=lambda x: -x[1])[:5]:
                print(f"  {word:<15} {count}")
            print(f"  Total events: {total}")

# Simulate a stream of words
words = ["apple", "banana", "cherry", "date", "elderberry", "fig", "grape", "honeydew"]

job = FlinkStreamJob(window_size_seconds=15)

print("=== Flink Streaming Word Count ===")
print("Processing events... (window: 15 seconds)\n")

for i in range(50):
    word = random.choice(words)
    job.process_event(word)
    time.sleep(0.3)

# Force emit final window
job.emit_window()

Expected output:

=== Flink Streaming Word Count ===
Processing events... (window: 15 seconds)

=== Window [10:00:00 - 10:00:15] ===
  banana           5
  apple            4
  grape            4
  cherry           3
  elderberry       3
  Total events: 24

=== Window [10:00:15 - 10:00:30] ===
  honeydew         5
  grape            5
  fig              4
  date             3
  apple            3
  Total events: 26

Kafka Streams Stateful Processing

Kafka Streams is a client library for building stream processing applications on top of Apache Kafka.

# kafka_streams_sim.py
# Simulate Kafka Streams stateful aggregation
from collections import defaultdict
from datetime import datetime
import json

class KafkaStreamsApp:
    def __init__(self, app_id="dodatech-processor"):
        self.app_id = app_id
        self.state_store = defaultdict(lambda: {"count": 0, "total": 0.0, "last_seen": None})

    def process_order(self, order):
        """Process an order event (simulated KStream)."""
        user_id = order["user_id"]
        amount = order["amount"]

        store = self.state_store[user_id]
        store["count"] += 1
        store["total"] += amount
        store["last_seen"] = datetime.now().isoformat()

        # Alert if user exceeds threshold in a window
        if store["total"] > 500:
            print(f"[ALERT] User {user_id}: total ${store['total']:.2f} exceeds threshold")
        else:
            print(f"[PROCESS] User {user_id}: +${amount:.2f} (total: ${store['total']:.2f})")

    def get_snapshot(self):
        return dict(self.state_store)

# Simulate order stream
orders = [
    {"order_id": "1001", "user_id": "alice", "amount": 50.00},
    {"order_id": "1002", "user_id": "bob", "amount": 300.00},
    {"order_id": "1003", "user_id": "alice", "amount": 200.00},
    {"order_id": "1004", "user_id": "charlie", "amount": 75.00},
    {"order_id": "1005", "user_id": "alice", "amount": 400.00},
    {"order_id": "1006", "user_id": "bob", "amount": 150.00},
]

app = KafkaStreamsApp()
print("=== Kafka Streams Order Processing ===")
for order in orders:
    app.process_order(order)

print("\n=== State Store Snapshot ===")
for user, state in app.get_snapshot().items():
    print(f"  {user}: {state['count']} orders, ${state['total']:.2f} total")

Expected output:

=== Kafka Streams Order Processing ===
[PROCESS] User alice: +$50.00 (total: $50.00)
[PROCESS] User bob: +$300.00 (total: $300.00)
[PROCESS] User alice: +$200.00 (total: $250.00)
[PROCESS] User charlie: +$75.00 (total: $75.00)
[ALERT] User alice: total $650.00 exceeds threshold
[PROCESS] User bob: +$150.00 (total: $450.00)

=== State Store Snapshot ===
  alice: 3 orders, $650.00 total
  bob: 2 orders, $450.00 total
  charlie: 1 orders, $75.00 total

Exactly-Once Semantics

Semantics	Description	Risk
At-most-once	Event processed at most once	Data loss on failure
At-least-once	Event processed at least once	Duplicate events
Exactly-once	Event processed exactly once	Highest overhead

Flink achieves exactly-once through distributed checkpoints and two-phase commit (transactional sinks). Kafka Streams uses transactional producer with idempotent writes.

Stream Processing Systems Comparison

Feature	Kafka Streams	Apache Flink	Spark Streaming
Model	Event-at-a-time	Event-at-a-time	Micro-batch
Latency	Milliseconds	Milliseconds	Seconds
State	RocksDB state store	RocksDB / FsState	Checkpointed RDDs
Exactly-once	Yes (transactional)	Yes (checkpoints)	Yes (WAL)
Language	Java/Scala only	Java/Scala/Python	Scala/Python/SQL
Watermarks	No native support	Advanced	Late data handling
Best for	Kafka-centric apps	Complex event processing	Migration from batch

Common Stream Processing Mistakes

1. Assuming Events Arrive in Order

Events can be delayed by network issues, mobile offline periods, or batch uploads. Always use event time with appropriate allowed lateness.

2. Ignoring Backpressure

When producers outpace consumers, systems crash. Implement backpressure handling — Kafka Streams and Flink handle this automatically; Spark Streaming needs buffer tuning.

3. Not Handling Late Data

Late events arrive after the window closes. Use allowed lateness (Flink) or side output for late events rather than dropping them silently.

4. State Growth Without Cleanup

Stateful operations (joins, aggregations) accumulate state. Use state TTL, session windows, or compaction to prevent unbounded state growth.

5. Running Streaming and Batch Separately

Maintaining separate code for batch and streaming doubles work. Use unified APIs (Flink’s DataStream + Table API, Spark’s Structured Streaming) that work for both.

6. Not Testing Failure Scenarios

Stream processing must survive worker crashes, network partitions, and Kafka broker failures. Test exactly-once guarantees with chaos engineering.

Practice Questions

1. What is the difference between event time and processing time?

Event time is when the event actually occurred (timestamp from source). Processing time is when the system processes it. Event time handles late arrivals correctly; processing time is simpler but inaccurate.

2. What is windowing in stream processing?

Windows group unbounded streams into finite buckets for aggregation. Types include tumbling (fixed non-overlapping), sliding (fixed overlapping), and session (activity-based gaps).

3. What does exactly-once semantics mean?

Each event is processed exactly one time, despite failures, retries, or restarts. It prevents both data loss and duplicates, typically achieved through distributed checkpoints and transactional writes.

4. How does Flink handle late-arriving data?

Flink uses watermarks (a threshold indicating “no more events before this time”) with allowed lateness configuration. Late events within the allowed period trigger recomputation; beyond it, events go to a side output.

5. Challenge: Design a stream processing system for a ride-sharing app that needs real-time driver location updates (every 2 seconds), surge pricing calculations (every minute), and daily driver earnings reports.

Kafka for location events (partitioned by driver_id). Flink for surge pricing (sliding window, 1-min length, 30-sec slide). Kafka Streams for per-driver earnings (session window, 24-hour gap). Batch pipeline (Airflow + Spark) for daily reconciliation and reports.

Mini Project: Real-Time Alert System

# realtime_alert.py
# Simulate a stream processing alert system
import time
import random
from datetime import datetime
from collections import deque

class AlertSystem:
    def __init__(self, error_threshold=3, window_seconds=30):
        self.error_log = deque()
        self.threshold = error_threshold
        self.window = window_seconds

    def process_log(self, log_entry):
        timestamp = datetime.now()
        self.error_log.append((timestamp, log_entry))
        self._clean_old(timestamp)

        if log_entry["severity"] == "ERROR":
            recent_errors = [e for t, e in self.error_log if e["severity"] == "ERROR"]
            if len(recent_errors) >= self.threshold:
                print(f"[ALERT] {self.threshold}+ errors in last {self.window}s! "
                      f"Current: {len(recent_errors)} errors")
                return True
        return False

    def _clean_old(self, now):
        while self.error_log and (now - self.error_log[0][0]).total_seconds() > self.window:
            self.error_log.popleft()

    def get_stats(self):
        total = len(self.error_log)
        errors = sum(1 for _, e in self.error_log if e["severity"] == "ERROR")
        return {"total_logs": total, "errors_in_window": errors}

system = AlertSystem(error_threshold=3, window_seconds=15)

print("=== Real-Time Error Alert System ===")
severities = ["INFO", "WARN", "ERROR"]
for i in range(20):
    entry = {
        "id": i,
        "service": random.choice(["api", "worker", "db"]),
        "severity": random.choices(severities, weights=[0.5, 0.3, 0.2])[0],
        "message": f"Log entry #{i}",
    }
    system.process_log(entry)
    time.sleep(0.2)

print(f"\nFinal stats: {system.get_stats()}")

Expected output:

=== Real-Time Error Alert System ===
[ALERT] 3+ errors in last 15s! Current: 3 errors
[ALERT] 3+ errors in last 15s! Current: 4 errors

Final stats: {'total_logs': 20, 'errors_in_window': 2}

Related Concepts

Apache Spark

Building Data Pipelines

Data Lakes

What’s Next

You now understand stream processing fundamentals! Next, learn building data pipelines end-to-end, combining batch and streaming with monitoring and data quality checks. Also explore Apache Kafka deeply for event streaming.

Practice daily — Run the Flink word count simulation with real data
Build a project — Create a real-time dashboard using Kafka + Flink + WebSocket
Explore related topics — Check out Kafka vs Pulsar comparisons

Remember: every expert was once a beginner. Keep coding!

Previous Apache Spark Guide — RDDs, DataFrames, and PySpark Examples Next Building Data Pipelines — End-to-End Design, Monitoring, and Orchestration

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Engineering