Skip to content
Stream Processing Guide — Kafka, Flink, and Real-Time Data Pipelines

Stream Processing Guide — Kafka, Flink, and Real-Time Data Pipelines

DodaTech Updated Jun 15, 2026 8 min read

Stream processing is the continuous computation of data as it arrives, enabling real-time analytics, alerts, and actions on unbounded data streams within sub-second latency.

What You’ll Learn

By the end of this tutorial, you’ll understand stream processing fundamentals — Kafka Streams, Apache Flink, Spark Streaming — the difference between event time and processing time, windowing strategies, exactly-once semantics, and how to build a Flink word count example.

Why Stream Processing Matters

Batch processing has hours of latency. In 2026, businesses need real-time responses — fraud detection within milliseconds, live dashboards updating every second, personalized recommendations as users browse. Stream processing makes this possible. DodaTech uses stream processing in Durga Antivirus Pro to detect malware patterns in real time as files are uploaded by millions of users.

Stream Processing Learning Path


flowchart LR
  A[Apache Spark] --> B[Stream Processing]
  B --> C{You Are Here}
  C --> D[Kafka Streams]
  C --> E[Apache Flink]
  C --> F[Spark Streaming]
  D --> G[Windowing]
  E --> H[Event Time]
  F --> I[Micro-batching]

Prerequisites: Understanding of data engineering basics, ETL pipelines, and Apache Spark. Java/Scala or Python familiarity helps.

What Is Stream Processing?

Think of stream processing like a sushi conveyor belt. Plates (data events) flow continuously. Multiple chefs (processors) watch the belt. One chef looks for salmon orders (filter), another counts total plates (aggregate), a third tracks if plates pass by too fast (alerting).

In batch processing, you’d wait until the restaurant closes, count all plates at once, and prepare a report. In stream processing, you act on every plate as it passes.

Batch vs Stream

BatchStream
DataBounded (finite, known size)Unbounded (infinite, ever-growing)
LatencyMinutes to hoursMilliseconds to seconds
TriggerSchedule or manualContinuous arrival
ProcessingFull dataset each runPer-event or micro-batch
StateNo state between runsStateful (counters, windows, joins)
ToolsAirflow, dbt, Spark batchFlink, Kafka Streams, Spark Streaming

Event Time vs Processing Time

Event TimeProcessing Time
DefinitionWhen the event actually occurredWhen the system processes the event
DeterminismDeterministic (won’t change)Non-deterministic (depends on system speed)
AccuracyReflects realityMay be delayed or reordered
UseCorrect analyticsMonitoring system health
ChallengeHandling late arrivalsNo late data concerns

Windowing

Windowing groups events into finite buckets for aggregation.


flowchart LR
  subgraph "Tumbling Window"
    direction LR
    W1[1:00-1:05] --> W2[1:05-1:10] --> W3[1:10-1:15]
  end
  subgraph "Sliding Window"
    direction LR
    S1[1:00-1:10] -.-> S2[1:05-1:15] -.-> S3[1:10-1:20]
  end
  subgraph "Session Window"
    direction LR
    SW1[Active 1:00-1:08] --- GP1[Gap 5 min]
    GP1 --- SW2[Active 1:13-1:20]
  end

TypeDescriptionUse Case
TumblingFixed, non-overlapping intervalsPage views per hour
SlidingFixed, overlapping intervalsRolling 10-minute average
SessionGaps in activity define boundariesUser session tracking

Apache Flink Word Count Example

# flink_word_count.py
# Simulated Apache Flink streaming word count
from collections import defaultdict
from datetime import datetime, timedelta
import random
import time

class FlinkStreamJob:
    def __init__(self, window_size_seconds=10):
        self.window_size = window_size_seconds
        self.window_counts = defaultdict(int)
        self.window_start = datetime.now()

    def process_event(self, word, event_time=None):
        """Process a single event (simulated Flink DataStream API)."""
        if event_time is None:
            event_time = datetime.now()

        current_time = datetime.now()
        elapsed = (current_time - self.window_start).total_seconds()

        if elapsed >= self.window_size:
            self.emit_window()
            self.window_start = current_time
            self.window_counts.clear()

        self.window_counts[word] += 1

    def emit_window(self):
        """Emit the current window result."""
        if self.window_counts:
            total = sum(self.window_counts.values())
            print(f"\n=== Window [{self.window_start.strftime('%H:%M:%S')} - "
                  f"{(self.window_start + timedelta(seconds=self.window_size)).strftime('%H:%M:%S')}] ===")
            for word, count in sorted(self.window_counts.items(), key=lambda x: -x[1])[:5]:
                print(f"  {word:<15} {count}")
            print(f"  Total events: {total}")

# Simulate a stream of words
words = ["apple", "banana", "cherry", "date", "elderberry", "fig", "grape", "honeydew"]

job = FlinkStreamJob(window_size_seconds=15)

print("=== Flink Streaming Word Count ===")
print("Processing events... (window: 15 seconds)\n")

for i in range(50):
    word = random.choice(words)
    job.process_event(word)
    time.sleep(0.3)

# Force emit final window
job.emit_window()

Expected output:

=== Flink Streaming Word Count ===
Processing events... (window: 15 seconds)

=== Window [10:00:00 - 10:00:15] ===
  banana           5
  apple            4
  grape            4
  cherry           3
  elderberry       3
  Total events: 24

=== Window [10:00:15 - 10:00:30] ===
  honeydew         5
  grape            5
  fig              4
  date             3
  apple            3
  Total events: 26

Kafka Streams Stateful Processing

Kafka Streams is a client library for building stream processing applications on top of Apache Kafka.

# kafka_streams_sim.py
# Simulate Kafka Streams stateful aggregation
from collections import defaultdict
from datetime import datetime
import json

class KafkaStreamsApp:
    def __init__(self, app_id="dodatech-processor"):
        self.app_id = app_id
        self.state_store = defaultdict(lambda: {"count": 0, "total": 0.0, "last_seen": None})

    def process_order(self, order):
        """Process an order event (simulated KStream)."""
        user_id = order["user_id"]
        amount = order["amount"]

        store = self.state_store[user_id]
        store["count"] += 1
        store["total"] += amount
        store["last_seen"] = datetime.now().isoformat()

        # Alert if user exceeds threshold in a window
        if store["total"] > 500:
            print(f"[ALERT] User {user_id}: total ${store['total']:.2f} exceeds threshold")
        else:
            print(f"[PROCESS] User {user_id}: +${amount:.2f} (total: ${store['total']:.2f})")

    def get_snapshot(self):
        return dict(self.state_store)

# Simulate order stream
orders = [
    {"order_id": "1001", "user_id": "alice", "amount": 50.00},
    {"order_id": "1002", "user_id": "bob", "amount": 300.00},
    {"order_id": "1003", "user_id": "alice", "amount": 200.00},
    {"order_id": "1004", "user_id": "charlie", "amount": 75.00},
    {"order_id": "1005", "user_id": "alice", "amount": 400.00},
    {"order_id": "1006", "user_id": "bob", "amount": 150.00},
]

app = KafkaStreamsApp()
print("=== Kafka Streams Order Processing ===")
for order in orders:
    app.process_order(order)

print("\n=== State Store Snapshot ===")
for user, state in app.get_snapshot().items():
    print(f"  {user}: {state['count']} orders, ${state['total']:.2f} total")

Expected output:

=== Kafka Streams Order Processing ===
[PROCESS] User alice: +$50.00 (total: $50.00)
[PROCESS] User bob: +$300.00 (total: $300.00)
[PROCESS] User alice: +$200.00 (total: $250.00)
[PROCESS] User charlie: +$75.00 (total: $75.00)
[ALERT] User alice: total $650.00 exceeds threshold
[PROCESS] User bob: +$150.00 (total: $450.00)

=== State Store Snapshot ===
  alice: 3 orders, $650.00 total
  bob: 2 orders, $450.00 total
  charlie: 1 orders, $75.00 total

Exactly-Once Semantics

SemanticsDescriptionRisk
At-most-onceEvent processed at most onceData loss on failure
At-least-onceEvent processed at least onceDuplicate events
Exactly-onceEvent processed exactly onceHighest overhead

Flink achieves exactly-once through distributed checkpoints and two-phase commit (transactional sinks). Kafka Streams uses transactional producer with idempotent writes.

Stream Processing Systems Comparison

FeatureKafka StreamsApache FlinkSpark Streaming
ModelEvent-at-a-timeEvent-at-a-timeMicro-batch
LatencyMillisecondsMillisecondsSeconds
StateRocksDB state storeRocksDB / FsStateCheckpointed RDDs
Exactly-onceYes (transactional)Yes (checkpoints)Yes (WAL)
LanguageJava/Scala onlyJava/Scala/PythonScala/Python/SQL
WatermarksNo native supportAdvancedLate data handling
Best forKafka-centric appsComplex event processingMigration from batch

Common Stream Processing Mistakes

1. Assuming Events Arrive in Order

Events can be delayed by network issues, mobile offline periods, or batch uploads. Always use event time with appropriate allowed lateness.

2. Ignoring Backpressure

When producers outpace consumers, systems crash. Implement backpressure handling — Kafka Streams and Flink handle this automatically; Spark Streaming needs buffer tuning.

3. Not Handling Late Data

Late events arrive after the window closes. Use allowed lateness (Flink) or side output for late events rather than dropping them silently.

4. State Growth Without Cleanup

Stateful operations (joins, aggregations) accumulate state. Use state TTL, session windows, or compaction to prevent unbounded state growth.

5. Running Streaming and Batch Separately

Maintaining separate code for batch and streaming doubles work. Use unified APIs (Flink’s DataStream + Table API, Spark’s Structured Streaming) that work for both.

6. Not Testing Failure Scenarios

Stream processing must survive worker crashes, network partitions, and Kafka broker failures. Test exactly-once guarantees with chaos engineering.

Practice Questions

1. What is the difference between event time and processing time?

Event time is when the event actually occurred (timestamp from source). Processing time is when the system processes it. Event time handles late arrivals correctly; processing time is simpler but inaccurate.

2. What is windowing in stream processing?

Windows group unbounded streams into finite buckets for aggregation. Types include tumbling (fixed non-overlapping), sliding (fixed overlapping), and session (activity-based gaps).

3. What does exactly-once semantics mean?

Each event is processed exactly one time, despite failures, retries, or restarts. It prevents both data loss and duplicates, typically achieved through distributed checkpoints and transactional writes.

4. How does Flink handle late-arriving data?

Flink uses watermarks (a threshold indicating “no more events before this time”) with allowed lateness configuration. Late events within the allowed period trigger recomputation; beyond it, events go to a side output.

5. Challenge: Design a stream processing system for a ride-sharing app that needs real-time driver location updates (every 2 seconds), surge pricing calculations (every minute), and daily driver earnings reports.

Kafka for location events (partitioned by driver_id). Flink for surge pricing (sliding window, 1-min length, 30-sec slide). Kafka Streams for per-driver earnings (session window, 24-hour gap). Batch pipeline (Airflow + Spark) for daily reconciliation and reports.

Mini Project: Real-Time Alert System

# realtime_alert.py
# Simulate a stream processing alert system
import time
import random
from datetime import datetime
from collections import deque

class AlertSystem:
    def __init__(self, error_threshold=3, window_seconds=30):
        self.error_log = deque()
        self.threshold = error_threshold
        self.window = window_seconds

    def process_log(self, log_entry):
        timestamp = datetime.now()
        self.error_log.append((timestamp, log_entry))
        self._clean_old(timestamp)

        if log_entry["severity"] == "ERROR":
            recent_errors = [e for t, e in self.error_log if e["severity"] == "ERROR"]
            if len(recent_errors) >= self.threshold:
                print(f"[ALERT] {self.threshold}+ errors in last {self.window}s! "
                      f"Current: {len(recent_errors)} errors")
                return True
        return False

    def _clean_old(self, now):
        while self.error_log and (now - self.error_log[0][0]).total_seconds() > self.window:
            self.error_log.popleft()

    def get_stats(self):
        total = len(self.error_log)
        errors = sum(1 for _, e in self.error_log if e["severity"] == "ERROR")
        return {"total_logs": total, "errors_in_window": errors}

system = AlertSystem(error_threshold=3, window_seconds=15)

print("=== Real-Time Error Alert System ===")
severities = ["INFO", "WARN", "ERROR"]
for i in range(20):
    entry = {
        "id": i,
        "service": random.choice(["api", "worker", "db"]),
        "severity": random.choices(severities, weights=[0.5, 0.3, 0.2])[0],
        "message": f"Log entry #{i}",
    }
    system.process_log(entry)
    time.sleep(0.2)

print(f"\nFinal stats: {system.get_stats()}")

Expected output:

=== Real-Time Error Alert System ===
[ALERT] 3+ errors in last 15s! Current: 3 errors
[ALERT] 3+ errors in last 15s! Current: 4 errors

Final stats: {'total_logs': 20, 'errors_in_window': 2}

Related Concepts

What’s Next

You now understand stream processing fundamentals! Next, learn building data pipelines end-to-end, combining batch and streaming with monitoring and data quality checks. Also explore Apache Kafka deeply for event streaming.

  • Practice daily — Run the Flink word count simulation with real data
  • Build a project — Create a real-time dashboard using Kafka + Flink + WebSocket
  • Explore related topics — Check out Kafka vs Pulsar comparisons

Remember: every expert was once a beginner. Keep coding!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro