Cloud Monitoring and Observability Guide — Metrics, Logs, Traces, and Alerting
Cloud monitoring and observability is the practice of collecting, analyzing, and acting on metrics, logs, and traces from cloud infrastructure and applications to ensure performance, availability, and reliability.
What You’ll Learn
By the end of this tutorial, you’ll understand the three pillars of observability (metrics, logs, traces), how AWS CloudWatch, Azure Monitor, and GCP Operations Suite work, how to set up alerting and dashboards, and how to create a CloudWatch alarm with code examples.
Why Monitoring Matters
Without monitoring, you’re flying blind. You don’t know when your application slows down, when errors spike, or when a database is about to run out of storage. Good monitoring means you discover issues before customers do — and when issues happen, you have the data to fix them fast. DodaTech monitors over 200 metrics across its infrastructure for Doda Browser and Durga Antivirus Pro, with automated alerting runbooks.
Cloud Monitoring Learning Path
flowchart LR
A[Cloud Basics] --> B[Cloud Monitoring]
B --> C{You Are Here}
C --> D[Metrics]
C --> E[Logs]
C --> F[Traces]
D --> G[CPU/Memory]
D --> H[Custom Metrics]
E --> I[Centralized Logging]
F --> J[Distributed Tracing]
What Is Observability?
Think of observability like a car’s dashboard. Metrics are the speedometer and fuel gauge (numeric values over time). Logs are the event recorder — every time the door opened, engine started, or warning light flashed. Traces are like following a specific package through a factory — you can see exactly which machine processed it and how long each step took.
Without observability, you’re driving without a dashboard. You know something’s wrong (engine’s making a noise) but you have no data to diagnose it.
The Three Pillars
| Pillar | What | Example | Purpose |
|---|---|---|---|
| Metrics | Numeric values over time | CPU 75%, 200 req/s, 99.9% uptime | Trends, alerts, dashboards |
| Logs | Discrete events with timestamps | “ERROR: Connection timeout on db-01” | Debugging, forensics |
| Traces | Request flow across services | API call → auth → db → cache | Latency analysis, dependency maps |
CloudWatch Metrics and Alarms
# cloudwatch_demo.py
# Simulate CloudWatch metrics and alarm evaluation
import time
import random
from datetime import datetime, timedelta
from collections import deque
class CloudWatchMetric:
def __init__(self, namespace, metric_name, unit="Count"):
self.namespace = namespace
self.metric_name = metric_name
self.unit = unit
self.datapoints = deque(maxlen=100)
def put_data(self, value, timestamp=None):
if timestamp is None:
timestamp = datetime.now()
self.datapoints.append({"timestamp": timestamp, "value": value})
def get_statistics(self, statistic, period_minutes=5):
cutoff = datetime.now() - timedelta(minutes=period_minutes)
recent = [d for d in self.datapoints if d["timestamp"] >= cutoff]
values = [d["value"] for d in recent]
if not values:
return 0
if statistic == "Average":
return sum(values) / len(values)
elif statistic == "Sum":
return sum(values)
elif statistic == "Maximum":
return max(values)
elif statistic == "Minimum":
return min(values)
return values[-1]
class CloudWatchAlarm:
def __init__(self, name, metric, operator, threshold, period_minutes=5, alarm_actions=None):
self.name = name
self.metric = metric
self.operator = operator
self.threshold = threshold
self.period = period_minutes
self.alarm_actions = alarm_actions or []
self.state = "OK"
self.evaluation_count = 0
def evaluate(self):
self.evaluation_count += 1
value = self.metric.get_statistics("Average", self.period)
if self.operator == "GreaterThanThreshold" and value > self.threshold:
new_state = "ALARM"
elif self.operator == "LessThanThreshold" and value < self.threshold:
new_state = "ALARM"
else:
new_state = "OK"
if new_state != self.state:
print(f" [STATE CHANGE] {self.name}: {self.state} → {new_state} (value: {value:.1f})")
if new_state == "ALARM":
for action in self.alarm_actions:
print(f" [ACTION] {action}")
self.state = new_state
return self.state
# Simulate monitoring
cpu_metric = CloudWatchMetric("AWS/EC2", "CPUUtilization", "Percent")
cpu_alarm = CloudWatchAlarm(
name="High-CPU-Alarm",
metric=cpu_metric,
operator="GreaterThanThreshold",
threshold=80,
period_minutes=5,
alarm_actions=["SNS: notify-ops@dodatech.com", "AutoScaling: add-instance"],
)
print("=== CloudWatch Monitoring Simulation ===\n")
for i in range(20):
cpu = random.uniform(30, 95)
cpu_metric.put_data(cpu)
print(f" Minute {i+1:>2}: CPU = {cpu:>5.1f}% Alarm state: {cpu_alarm.evaluate()}")
time.sleep(0.1)
print(f"\nFinal alarm state: {cpu_alarm.state}")
print(f"Total evaluations: {cpu_alarm.evaluation_count}")Expected output:
=== CloudWatch Monitoring Simulation ===
Minute 1: CPU = 45.2% Alarm state: OK
Minute 2: CPU = 72.8% Alarm state: OK
Minute 3: CPU = 91.5% Alarm state: ALARM
[STATE CHANGE] High-CPU-Alarm: OK → ALARM (value: 69.8)
[ACTION] SNS: notify-ops@dodatech.com
[ACTION] AutoScaling: add-instance
...Azure Monitor
# azure_monitor_demo.py
# Simulate Azure Monitor metrics and log queries
from datetime import datetime
import json
class AzureMonitor:
def __init__(self, resource_id):
self.resource_id = resource_id
self.metrics = {}
self.logs = []
def record_metric(self, name, value, unit="Count"):
if name not in self.metrics:
self.metrics[name] = []
self.metrics[name].append({"timestamp": datetime.now().isoformat(), "value": value, "unit": unit})
def log_event(self, level, message, properties=None):
self.logs.append({
"timestamp": datetime.now().isoformat(),
"resource": self.resource_id,
"level": level,
"message": message,
"properties": properties or {},
})
def query(self, kql_query):
"""Simulate Kusto Query Language (KQL) on logs."""
results = []
for log in self.logs:
if log["level"] in kql_query.get("levels", ["ERROR", "WARN"]):
results.append(log)
return results
def create_alert(self, metric_name, condition, threshold, action_group):
"""Simulate creating an alert rule."""
current_value = self.metrics.get(metric_name, [{}])[-1].get("value", 0)
triggered = False
if condition == "greater_than" and current_value > threshold:
triggered = True
elif condition == "less_than" and current_value < threshold:
triggered = True
if triggered:
print(f"[ALERT] {metric_name} triggered {condition} {threshold} (current: {current_value})")
print(f"[ACTION] Notifying {action_group}")
return triggered
monitor = AzureMonitor("/subscriptions/demo/resourceGroups/prod/providers/Microsoft.Compute/virtualMachines/web-01")
print("=== Azure Monitor Demo ===\n")
# Record metrics
for i in range(10):
monitor.record_metric("Percentage CPU", 50 + i * 5, "Percent")
monitor.record_metric("Available Memory MB", 2048 - i * 100, "Bytes")
# Log events
monitor.log_event("INFO", "Application started", {"version": "1.0.0"})
monitor.log_event("WARN", "Memory usage above 75%", {"memory_mb": 1536})
monitor.log_event("ERROR", "Connection pool exhausted", {"pool_size": 10, "active": 10})
monitor.log_event("INFO", "Request completed", {"path": "/api/users", "duration_ms": 245})
# Query logs
errors = monitor.query({"levels": ["ERROR", "WARN"]})
print("Recent Errors/Warnings:")
for log in errors:
print(f" [{log['level']}] {log['message']}")
# Create alert
print("\nCreating alerts...")
monitor.create_alert("Percentage CPU", "greater_than", 80, "ops-team-email")
monitor.create_alert("Available Memory MB", "less_than", 500, "ops-team-sms")Expected output:
=== Azure Monitor Demo ===
Recent Errors/Warnings:
[WARN] Memory usage above 75%
[ERROR] Connection pool exhausted
Creating alerts...
[ALERT] Percentage CPU triggered greater_than 80 (current: 85)
[ACTION] Notifying ops-team-email
[ALERT] Available Memory MB triggered less_than 500 (current: 1148)
[ACTION] Notifying ops-team-smsGCP Operations Suite
GCP’s monitoring stack (formerly Stackdriver) includes:
- Cloud Monitoring — Metrics, dashboards, uptime checks
- Cloud Logging — Centralized log storage and querying
- Cloud Trace — Distributed tracing latency analysis
- Cloud Profiler — Continuous CPU and heap profiling
- Error Reporting — Automatic error grouping and alerting
Setting Up a CloudWatch Alarm (Real Example)
# Create a CloudWatch alarm for high CPU on an EC2 instance
aws cloudwatch put-metric-alarm \
--alarm-name "web-prod-high-cpu" \
--alarm-description "CPU > 80% for 5 minutes" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
--ok-actions arn:aws:sns:us-east-1:123456789012:ops-alerts# monitoring_dashboard.py
# Build a simple monitoring dashboard simulation
from datetime import datetime
import random
class Dashboard:
def __init__(self, name):
self.name = name
self.widgets = []
def add_widget(self, title, metric_func, unit=""):
self.widgets.append({"title": title, "metric": metric_func, "unit": unit})
def render(self):
print(f"\n{'='*60}")
print(f" {self.name}")
print(f" Last updated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"{'='*60}\n")
for widget in self.widgets:
value = widget["metric"]()
print(f" 📊 {widget['title']:<25} {value:>10.1f} {widget['unit']}")
dash = Dashboard("DodaTech Production Overview")
dash.add_widget("Web Servers (req/s)", lambda: random.uniform(150, 350), "req/s")
dash.add_widget("API Latency (p99)", lambda: random.uniform(45, 200), "ms")
dash.add_widget("Error Rate", lambda: random.uniform(0.1, 2.5), "%")
dash.add_widget("Database Connections", lambda: random.randint(10, 50), "conn")
dash.add_widget("Daily Active Users", lambda: random.randint(8000, 12000), "users")
dash.add_widget("S3 Storage (TB)", lambda: 12.4 + random.uniform(-0.1, 0.1), "TB")
dash.add_widget("Lambda Invocations/min", lambda: random.randint(200, 800), "inv")
dash.add_widget("Estimated Monthly Cost", lambda: random.uniform(4500, 5200), "USD")
dash.render()Expected output:
============================================================
DodaTech Production Overview
Last updated: 2026-06-15 10:00:00
============================================================
📊 Web Servers (req/s) 245.3 req/s
📊 API Latency (p99) 87.2 ms
📊 Error Rate 1.1 %
📊 Database Connections 34.0 conn
📊 Daily Active Users 10543.0 users
📊 S3 Storage (TB) 12.4 TB
📊 Lambda Invocations/min 567.0 inv
📊 Estimated Monthly Cost 4987.3 USDCommon Monitoring Mistakes
1. Alert Fatigue
Too many alerts that never get acted on. Teams ignore alerts, then miss critical ones. Only alert on actionable conditions, not every metric variance.
2. Not Setting Up Any Alerts
Your application goes down at 3 AM. You find out at 9 AM when users complain. Always set up at least basic health checks and error rate alerts.
3. Measuring Everything, Understanding Nothing
Collecting 1000 metrics but only looking at the dashboard once a week. Define SLOs (Service Level Objectives) and monitor only what matters for them.
4. No Log Aggregation
Logs are spread across 50 servers. Debugging requires SSH-ing into each one. Use centralized logging (CloudWatch Logs, Azure Log Analytics, GCP Logging).
5. Not Using Structured Logging
print(f"User {id} logged in") is hard to parse. Use structured JSON logs: {"event": "login", "user_id": 123, "timestamp": "..."} for automated analysis.
6. Ignoring Distributed Tracing
In a microservice architecture, knowing a request failed is useless. You need to know which service caused it. Implement tracing with OpenTelemetry.
Practice Questions
1. What are the three pillars of observability?
Metrics (numeric time-series data), Logs (timestamped events), and Traces (request flow across services). Together they provide complete visibility into system behavior.
2. What is the difference between monitoring and observability?
Monitoring is collecting and alerting on known metrics. Observability is the ability to understand unknown system states by exploring metrics, logs, and traces. Monitoring tells you what’s wrong; observability tells you why.
3. How does CloudWatch work?
CloudWatch collects metrics from AWS services (EC2 CPU, RDS connections, Lambda invocations) and custom application metrics. You create alarms that trigger actions (SNS, Auto Scaling) based on threshold conditions.
4. What is a good alert design pattern?
Alert on rate of errors (not individual occurrences), use multiple evaluation periods to avoid flapping, define runbooks for every alert, and test alerts regularly with chaos engineering.
5. Challenge: Design a monitoring strategy for a microservice application with 20 services, 10,000 requests/second, deployed on Kubernetes across 3 cloud regions.
Collect RED metrics (Rate, Errors, Duration) per service. Use Prometheus for metrics, OpenTelemetry for traces, and Loki for logs. Set SLOs (99.9% uptime, <200ms p99 latency). Create tiered alerts: pager for SLO violations, email for warnings, dashboard burn rate alerts.
Mini Project: Observability Dashboard Builder
# observability_dashboard.py
# Build a multi-service observability dashboard
import random
from datetime import datetime, timedelta
class Service:
def __init__(self, name):
self.name = name
self.metrics = {"request_rate": 0, "error_rate": 0, "latency_p99": 0, "health": "healthy"}
def collect(self):
self.metrics["request_rate"] = random.randint(50, 500)
self.metrics["error_rate"] = random.uniform(0, 3)
self.metrics["latency_p99"] = random.uniform(20, 500)
self.metrics["health"] = "degraded" if self.metrics["error_rate"] > 2 or self.metrics["latency_p99"] > 400 else "healthy"
class ObservabilityDashboard:
def __init__(self):
self.services = []
def add_service(self, service):
self.services.append(service)
def refresh(self):
print(f"\n=== Observability Dashboard [{datetime.now().strftime('%H:%M:%S')}] ===")
print(f"{'Service':<20} {'Req/s':<10} {'Errors':<10} {'p99 (ms)':<12} {'Status'}")
print("-" * 65)
for svc in self.services:
svc.collect()
m = svc.metrics
icon = "✓" if m["health"] == "healthy" else "⚠"
print(f"{icon} {svc.name:<18} {m['request_rate']:<10} {m['error_rate']:<10.1f}% {m['latency_p99']:<12.0f}{m['health']}")
dash = ObservabilityDashboard()
for name in ["api-gateway", "user-service", "order-service", "payment-service", "notification-service", "analytics-service", "ml-inference"]:
dash.add_service(Service(name))
dash.refresh()
dash.refresh()Expected output:
=== Observability Dashboard [10:00:00] ===
Service Req/s Errors p99 (ms) Status
-----------------------------------------------------------------
✓ api-gateway 342 1.2% 145ms healthy
✓ user-service 156 0.5% 89ms healthy
⚠ order-service 278 2.8% 456ms degraded
✓ payment-service 89 0.1% 234ms healthy
✓ notification-svc 167 1.8% 67ms healthy
...Related Concepts
What’s Next
You now understand cloud monitoring and observability! Apply these concepts with Prometheus and Grafana for open-source monitoring, and explore cloud security for security monitoring.
- Practice daily — Set up a dashboard for a personal project with UptimeRobot or Grafana Cloud
- Build a project — Configure CloudWatch alarms for your AWS resources
- Explore related topics — Check out OpenTelemetry for standardized observability instrumentation
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro