Apache Flink — Stream Processing, Event Time, Watermarks & Windowing
Apache Flink is a distributed stream processing framework that processes data in real time — designed for event-driven applications, real-time analytics, and continuous data pipelines.
What You’ll Learn
In this tutorial, you’ll learn Flink’s stream processing model, event time vs processing time, watermarks, windowing (tumbling, sliding, session), state management, exactly-once semantics, and how Flink compares to Spark Streaming.
Why It Matters
Real-time data processing is essential for fraud detection, recommendation systems, monitoring, and IoT analytics. Flink’s true streaming architecture (not micro-batching) makes it the best choice for low-latency stream processing.
Real-World Use
Uber uses Flink for real-time fraud detection — analyzing ride transactions within milliseconds. Alibaba processes petabytes of data per day with Flink for real-time search and recommendations. Durga Antivirus Pro uses stream processing to detect threats as events arrive, not hours later.
graph LR
subgraph "Flink Architecture"
A[Data Source
Kafka / Kinesis] --> B[Flink Job]
B --> C[Operator 1
Map]
C --> D[Operator 2
Window]
D --> E[Sink
Database / Dashboard]
end
subgraph "Event Time Processing"
F[Events with Timestamps] --> G[Watermark]
G --> H[Window Trigger]
H --> I[Window Result]
end
Stream Processing Model
Unlike Spark Streaming’s micro-batch model (processing data in small batches), Flink processes each event as it arrives — a true event-by-event streaming architecture.
| Feature | Flink | Spark Streaming |
|---|---|---|
| Model | True streaming | Micro-batching |
| Latency | Milliseconds | Seconds |
| Exactly-once | Built-in | Transactional sinks |
| Event time | Native support | Requires watermarking |
| State management | Keyed state, timers | Checkpointing via DStream |
| SQL support | Flink SQL | Structured Streaming |
Event Time vs Processing Time
| Concept | Definition | Use Case |
|---|---|---|
| Event Time | When the event actually occurred | Sensor data, financial transactions |
| Ingestion Time | When Flink received the event | Log collection |
| Processing Time | The current system time | Monitoring, alerting |
Watermarks
A watermark is Flink’s mechanism to handle late events. It declares: “I’ve processed all events up to time T.”
// Conceptual watermark: events with timestamp < watermark are "complete"
// Watermark = CurrentEventTime - MaxOutOfOrdernessIf you set max out-of-orderness to 5 seconds, a watermark of 12:00:10 means all events up to 12:00:05 are expected to have arrived. Events with timestamps before the watermark are considered late.
Windowing
Flink supports several window types for aggregating over time or count.
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import MapFunction, WindowFunction
from pyflink.common.time import Time
from pyflink.common.typeinfo import Types
from pyflink.datastream.window import TumblingEventTimeWindows
# Flink word count (Python API)
env = StreamExecutionEnvironment.get_execution_environment()
# Create stream (simulated)
data = ["hello world", "hello flink", "world hello", "hello flink flink"]
stream = env.from_collection(data)
# Word count with tumbling window of 5 seconds
result = (
stream
.flat_map(lambda text: [(word, 1) for word in text.split()],
output_type=Types.TUPLE([Types.STRING(), Types.INT()]))
.key_by(lambda x: x[0])
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum(1)
)
result.print()
env.execute("Flink Word Count")Window Types
| Window | Description | Example |
|---|---|---|
| Tumbling | Fixed-size, non-overlapping | Every 1 hour: “sales per hour” |
| Sliding | Fixed-size, overlapping | Every 10 minutes: “30-minute moving average” |
| Session | Gap-based, dynamic size | “User session — 5 min inactivity ends session” |
| Global | Single global window | Triggers when a condition is met |
State Management
Flink’s state is keyed — each operator stores state per partition key.
| State Type | Description | Example |
|---|---|---|
| ValueState | Single value per key | Last seen temperature |
| ListState | List of values per key | Recent transactions |
| MapState | Map (key-value) per key | User session attributes |
| AggregatingState | Incremental aggregation | Running sum |
State is checkpointed to durable storage (HDFS, S3) for fault recovery.
Exactly-Once Semantics
Flink guarantees exactly-once through lightweight distributed snapshots (checkpoints). The algorithm (Chandy-Lamport) captures the state of all operators and sources simultaneously without pausing the stream.
1. Source marks all events with a checkpoint barrier
2. Barrier propagates through operators
3. Each operator snapshots its state
4. When all operators complete, the checkpoint is committed
5. On failure: restore from the last successful checkpointCommon Mistakes
- Using processing time when event time is needed: Processing time ignores event arrival delays. For accurate analytics (especially with mobile/offline sources), always use event time.
- Setting watermarks too aggressively: Too-strict watermarks cause excessive data dropping. Too-loose watermarks increase latency. Measure your data’s tardiness.
- Ignoring state size: Flink state is stored in memory or RocksDB. Unbounded state grows forever — use time-to-live (TTL) or session windows.
- Not configuring checkpointing: Without checkpointing, a Flink job loses all state on failure. Enable checkpointing for production.
- Choosing the wrong backend: For memory state (fast, limited), use HashMapStateBackend. For large state (RocksDB, slower), use EmbeddedRocksDBStateBackend.
Practice Questions
How does Flink differ from Spark Streaming? Flink processes events one-by-one (true streaming, sub-second latency). Spark Streaming processes micro-batches (second-level latency).
What is a watermark in Flink? A mechanism to track event time progress and handle late-arriving data. Events with timestamps before the watermark are considered late.
What are the different window types in Flink? Tumbling (fixed, non-overlapping), sliding (fixed, overlapping), session (dynamic, inactivity gap).
How does Flink achieve exactly-once processing? Through distributed snapshots (checkpoints) using the Chandy-Lamport algorithm. State is restored from the last checkpoint on failure.
What is keyed state vs operator state? Keyed state is partitioned by key (e.g., per user). Operator state is per parallel instance (e.g., per Kafka partition).
Challenge
Set up a Flink cluster (local mode). Create a streaming job that reads from a Kafka topic, applies a sliding window (10 min, slide 1 min) to calculate a moving average, and writes results to another Kafka topic.
Real-World Task
Using the Flink dashboard (localhost:8081 in local mode), examine the job graph. How many parallel instances does each operator have? What’s the current watermark progress?
Mini Project: Anomaly Detection Pipeline
Build a Flink job that reads sensor events (temperature, pressure) and detects anomalies: if a value exceeds 3 standard deviations from the 1-hour rolling average, emit an alert. Use event_time and sliding windows.
Security angle: Real-time anomaly detection powers cybersecurity tools. Flink’s low-latency stream processing enables Durga Antivirus Pro to detect and respond to threats in milliseconds.
What’s Next
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
What’s Next
Congratulations on completing this Apache Flink tutorial! Here’s where to go from here:
- Practice daily — Consistency is more important than long study sessions
- Build a project — Apply what you learned by building something real
- Explore related topics — Check out other tutorials in the same category
- Join the community — Discuss with other learners and share your progress
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro