Learn Big: Apache Flink — Stream Processing, Event Time, Watermarks & Windowing

Apache Flink — Stream Processing, Event Time, Watermarks & Windowing

DodaTech Updated Jun 15, 2026 5 min read

Apache Flink is a distributed stream processing framework that processes data in real time — designed for event-driven applications, real-time analytics, and continuous data pipelines.

What You’ll Learn

In this tutorial, you’ll learn Flink’s stream processing model, event time vs processing time, watermarks, windowing (tumbling, sliding, session), state management, exactly-once semantics, and how Flink compares to Spark Streaming.

Why It Matters

Real-time data processing is essential for fraud detection, recommendation systems, monitoring, and IoT analytics. Flink’s true streaming architecture (not micro-batching) makes it the best choice for low-latency stream processing.

Real-World Use

Uber uses Flink for real-time fraud detection — analyzing ride transactions within milliseconds. Alibaba processes petabytes of data per day with Flink for real-time search and recommendations. Durga Antivirus Pro uses stream processing to detect threats as events arrive, not hours later.


graph LR
  subgraph "Flink Architecture"
    A[Data Source
Kafka / Kinesis] --> B[Flink Job]
    B --> C[Operator 1
Map]
    C --> D[Operator 2
Window]
    D --> E[Sink
Database / Dashboard]
  end
  subgraph "Event Time Processing"
    F[Events with Timestamps] --> G[Watermark]
    G --> H[Window Trigger]
    H --> I[Window Result]
  end

Stream Processing Model

Unlike Spark Streaming’s micro-batch model (processing data in small batches), Flink processes each event as it arrives — a true event-by-event streaming architecture.

Feature	Flink	Spark Streaming
Model	True streaming	Micro-batching
Latency	Milliseconds	Seconds
Exactly-once	Built-in	Transactional sinks
Event time	Native support	Requires watermarking
State management	Keyed state, timers	Checkpointing via DStream
SQL support	Flink SQL	Structured Streaming

Event Time vs Processing Time

Concept	Definition	Use Case
Event Time	When the event actually occurred	Sensor data, financial transactions
Ingestion Time	When Flink received the event	Log collection
Processing Time	The current system time	Monitoring, alerting

Watermarks

A watermark is Flink’s mechanism to handle late events. It declares: “I’ve processed all events up to time T.”

// Conceptual watermark: events with timestamp < watermark are "complete"
// Watermark = CurrentEventTime - MaxOutOfOrderness

If you set max out-of-orderness to 5 seconds, a watermark of 12:00:10 means all events up to 12:00:05 are expected to have arrived. Events with timestamps before the watermark are considered late.

Windowing

Flink supports several window types for aggregating over time or count.

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import MapFunction, WindowFunction
from pyflink.common.time import Time
from pyflink.common.typeinfo import Types
from pyflink.datastream.window import TumblingEventTimeWindows

# Flink word count (Python API)
env = StreamExecutionEnvironment.get_execution_environment()

# Create stream (simulated)
data = ["hello world", "hello flink", "world hello", "hello flink flink"]
stream = env.from_collection(data)

# Word count with tumbling window of 5 seconds
result = (
    stream
    .flat_map(lambda text: [(word, 1) for word in text.split()],
              output_type=Types.TUPLE([Types.STRING(), Types.INT()]))
    .key_by(lambda x: x[0])
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .sum(1)
)

result.print()
env.execute("Flink Word Count")

Window Types

Window	Description	Example
Tumbling	Fixed-size, non-overlapping	Every 1 hour: “sales per hour”
Sliding	Fixed-size, overlapping	Every 10 minutes: “30-minute moving average”
Session	Gap-based, dynamic size	“User session — 5 min inactivity ends session”
Global	Single global window	Triggers when a condition is met

State Management

Flink’s state is keyed — each operator stores state per partition key.

State Type	Description	Example
ValueState	Single value per key	Last seen temperature
ListState	List of values per key	Recent transactions
MapState	Map (key-value) per key	User session attributes
AggregatingState	Incremental aggregation	Running sum

State is checkpointed to durable storage (HDFS, S3) for fault recovery.

Exactly-Once Semantics

Flink guarantees exactly-once through lightweight distributed snapshots (checkpoints). The algorithm (Chandy-Lamport) captures the state of all operators and sources simultaneously without pausing the stream.

1. Source marks all events with a checkpoint barrier
2. Barrier propagates through operators
3. Each operator snapshots its state
4. When all operators complete, the checkpoint is committed
5. On failure: restore from the last successful checkpoint

Common Mistakes

Using processing time when event time is needed: Processing time ignores event arrival delays. For accurate analytics (especially with mobile/offline sources), always use event time.
Setting watermarks too aggressively: Too-strict watermarks cause excessive data dropping. Too-loose watermarks increase latency. Measure your data’s tardiness.
Ignoring state size: Flink state is stored in memory or RocksDB. Unbounded state grows forever — use time-to-live (TTL) or session windows.
Not configuring checkpointing: Without checkpointing, a Flink job loses all state on failure. Enable checkpointing for production.
Choosing the wrong backend: For memory state (fast, limited), use HashMapStateBackend. For large state (RocksDB, slower), use EmbeddedRocksDBStateBackend.

Practice Questions

How does Flink differ from Spark Streaming? Flink processes events one-by-one (true streaming, sub-second latency). Spark Streaming processes micro-batches (second-level latency).
What is a watermark in Flink? A mechanism to track event time progress and handle late-arriving data. Events with timestamps before the watermark are considered late.
What are the different window types in Flink? Tumbling (fixed, non-overlapping), sliding (fixed, overlapping), session (dynamic, inactivity gap).
How does Flink achieve exactly-once processing? Through distributed snapshots (checkpoints) using the Chandy-Lamport algorithm. State is restored from the last checkpoint on failure.
What is keyed state vs operator state? Keyed state is partitioned by key (e.g., per user). Operator state is per parallel instance (e.g., per Kafka partition).

Challenge

Set up a Flink cluster (local mode). Create a streaming job that reads from a Kafka topic, applies a sliding window (10 min, slide 1 min) to calculate a moving average, and writes results to another Kafka topic.

Real-World Task

Using the Flink dashboard (localhost:8081 in local mode), examine the job graph. How many parallel instances does each operator have? What’s the current watermark progress?

Mini Project: Anomaly Detection Pipeline

Build a Flink job that reads sensor events (temperature, pressure) and detects anomalies: if a value exceeds 3 standard deviations from the 1-hour rolling average, emit an alert. Use event_time and sliding windows.

Security angle: Real-time anomaly detection powers cybersecurity tools. Flink’s low-latency stream processing enables Durga Antivirus Pro to detect and respond to threats in milliseconds.

What’s Next

Data Governance — Next Lesson

Review: Apache Kafka

Real-Time Analytics

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

What’s Next

Congratulations on completing this Apache Flink tutorial! Here’s where to go from here:

Practice daily — Consistency is more important than long study sessions
Build a project — Apply what you learned by building something real
Explore related topics — Check out other tutorials in the same category
Join the community — Discuss with other learners and share your progress

Remember: every expert was once a beginner. Keep coding!

Previous Power BI Guide — Data Modeling, DAX Formulas, Power Query & Dashboards Next Data Governance Explained — Catalogs, Lineage, Quality & GDPR Compliance

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Big Data & Analytics