Skip to content
Cloud Monitoring and Observability Guide — Metrics, Logs, Traces, and Alerting

Cloud Monitoring and Observability Guide — Metrics, Logs, Traces, and Alerting

DodaTech Updated Jun 15, 2026 9 min read

Cloud monitoring and observability is the practice of collecting, analyzing, and acting on metrics, logs, and traces from cloud infrastructure and applications to ensure performance, availability, and reliability.

What You’ll Learn

By the end of this tutorial, you’ll understand the three pillars of observability (metrics, logs, traces), how AWS CloudWatch, Azure Monitor, and GCP Operations Suite work, how to set up alerting and dashboards, and how to create a CloudWatch alarm with code examples.

Why Monitoring Matters

Without monitoring, you’re flying blind. You don’t know when your application slows down, when errors spike, or when a database is about to run out of storage. Good monitoring means you discover issues before customers do — and when issues happen, you have the data to fix them fast. DodaTech monitors over 200 metrics across its infrastructure for Doda Browser and Durga Antivirus Pro, with automated alerting runbooks.

Cloud Monitoring Learning Path


flowchart LR
  A[Cloud Basics] --> B[Cloud Monitoring]
  B --> C{You Are Here}
  C --> D[Metrics]
  C --> E[Logs]
  C --> F[Traces]
  D --> G[CPU/Memory]
  D --> H[Custom Metrics]
  E --> I[Centralized Logging]
  F --> J[Distributed Tracing]

Prerequisites: Understanding of cloud computing fundamentals. Familiarity with AWS, Azure, or GCP concepts.

What Is Observability?

Think of observability like a car’s dashboard. Metrics are the speedometer and fuel gauge (numeric values over time). Logs are the event recorder — every time the door opened, engine started, or warning light flashed. Traces are like following a specific package through a factory — you can see exactly which machine processed it and how long each step took.

Without observability, you’re driving without a dashboard. You know something’s wrong (engine’s making a noise) but you have no data to diagnose it.

The Three Pillars

PillarWhatExamplePurpose
MetricsNumeric values over timeCPU 75%, 200 req/s, 99.9% uptimeTrends, alerts, dashboards
LogsDiscrete events with timestamps“ERROR: Connection timeout on db-01”Debugging, forensics
TracesRequest flow across servicesAPI call → auth → db → cacheLatency analysis, dependency maps

CloudWatch Metrics and Alarms

# cloudwatch_demo.py
# Simulate CloudWatch metrics and alarm evaluation
import time
import random
from datetime import datetime, timedelta
from collections import deque

class CloudWatchMetric:
    def __init__(self, namespace, metric_name, unit="Count"):
        self.namespace = namespace
        self.metric_name = metric_name
        self.unit = unit
        self.datapoints = deque(maxlen=100)

    def put_data(self, value, timestamp=None):
        if timestamp is None:
            timestamp = datetime.now()
        self.datapoints.append({"timestamp": timestamp, "value": value})

    def get_statistics(self, statistic, period_minutes=5):
        cutoff = datetime.now() - timedelta(minutes=period_minutes)
        recent = [d for d in self.datapoints if d["timestamp"] >= cutoff]
        values = [d["value"] for d in recent]
        if not values:
            return 0
        if statistic == "Average":
            return sum(values) / len(values)
        elif statistic == "Sum":
            return sum(values)
        elif statistic == "Maximum":
            return max(values)
        elif statistic == "Minimum":
            return min(values)
        return values[-1]

class CloudWatchAlarm:
    def __init__(self, name, metric, operator, threshold, period_minutes=5, alarm_actions=None):
        self.name = name
        self.metric = metric
        self.operator = operator
        self.threshold = threshold
        self.period = period_minutes
        self.alarm_actions = alarm_actions or []
        self.state = "OK"
        self.evaluation_count = 0

    def evaluate(self):
        self.evaluation_count += 1
        value = self.metric.get_statistics("Average", self.period)

        if self.operator == "GreaterThanThreshold" and value > self.threshold:
            new_state = "ALARM"
        elif self.operator == "LessThanThreshold" and value < self.threshold:
            new_state = "ALARM"
        else:
            new_state = "OK"

        if new_state != self.state:
            print(f"  [STATE CHANGE] {self.name}: {self.state}{new_state} (value: {value:.1f})")
            if new_state == "ALARM":
                for action in self.alarm_actions:
                    print(f"  [ACTION] {action}")
            self.state = new_state

        return self.state

# Simulate monitoring
cpu_metric = CloudWatchMetric("AWS/EC2", "CPUUtilization", "Percent")
cpu_alarm = CloudWatchAlarm(
    name="High-CPU-Alarm",
    metric=cpu_metric,
    operator="GreaterThanThreshold",
    threshold=80,
    period_minutes=5,
    alarm_actions=["SNS: notify-ops@dodatech.com", "AutoScaling: add-instance"],
)

print("=== CloudWatch Monitoring Simulation ===\n")
for i in range(20):
    cpu = random.uniform(30, 95)
    cpu_metric.put_data(cpu)
    print(f"  Minute {i+1:>2}: CPU = {cpu:>5.1f}%  Alarm state: {cpu_alarm.evaluate()}")
    time.sleep(0.1)

print(f"\nFinal alarm state: {cpu_alarm.state}")
print(f"Total evaluations: {cpu_alarm.evaluation_count}")

Expected output:

=== CloudWatch Monitoring Simulation ===

  Minute  1: CPU =  45.2%  Alarm state: OK
  Minute  2: CPU =  72.8%  Alarm state: OK
  Minute  3: CPU =  91.5%  Alarm state: ALARM
  [STATE CHANGE] High-CPU-Alarm: OK → ALARM (value: 69.8)
  [ACTION] SNS: notify-ops@dodatech.com
  [ACTION] AutoScaling: add-instance
  ...

Azure Monitor

# azure_monitor_demo.py
# Simulate Azure Monitor metrics and log queries
from datetime import datetime
import json

class AzureMonitor:
    def __init__(self, resource_id):
        self.resource_id = resource_id
        self.metrics = {}
        self.logs = []

    def record_metric(self, name, value, unit="Count"):
        if name not in self.metrics:
            self.metrics[name] = []
        self.metrics[name].append({"timestamp": datetime.now().isoformat(), "value": value, "unit": unit})

    def log_event(self, level, message, properties=None):
        self.logs.append({
            "timestamp": datetime.now().isoformat(),
            "resource": self.resource_id,
            "level": level,
            "message": message,
            "properties": properties or {},
        })

    def query(self, kql_query):
        """Simulate Kusto Query Language (KQL) on logs."""
        results = []
        for log in self.logs:
            if log["level"] in kql_query.get("levels", ["ERROR", "WARN"]):
                results.append(log)
        return results

    def create_alert(self, metric_name, condition, threshold, action_group):
        """Simulate creating an alert rule."""
        current_value = self.metrics.get(metric_name, [{}])[-1].get("value", 0)
        triggered = False
        if condition == "greater_than" and current_value > threshold:
            triggered = True
        elif condition == "less_than" and current_value < threshold:
            triggered = True

        if triggered:
            print(f"[ALERT] {metric_name} triggered {condition} {threshold} (current: {current_value})")
            print(f"[ACTION] Notifying {action_group}")
        return triggered

monitor = AzureMonitor("/subscriptions/demo/resourceGroups/prod/providers/Microsoft.Compute/virtualMachines/web-01")

print("=== Azure Monitor Demo ===\n")

# Record metrics
for i in range(10):
    monitor.record_metric("Percentage CPU", 50 + i * 5, "Percent")
    monitor.record_metric("Available Memory MB", 2048 - i * 100, "Bytes")

# Log events
monitor.log_event("INFO", "Application started", {"version": "1.0.0"})
monitor.log_event("WARN", "Memory usage above 75%", {"memory_mb": 1536})
monitor.log_event("ERROR", "Connection pool exhausted", {"pool_size": 10, "active": 10})
monitor.log_event("INFO", "Request completed", {"path": "/api/users", "duration_ms": 245})

# Query logs
errors = monitor.query({"levels": ["ERROR", "WARN"]})
print("Recent Errors/Warnings:")
for log in errors:
    print(f"  [{log['level']}] {log['message']}")

# Create alert
print("\nCreating alerts...")
monitor.create_alert("Percentage CPU", "greater_than", 80, "ops-team-email")
monitor.create_alert("Available Memory MB", "less_than", 500, "ops-team-sms")

Expected output:

=== Azure Monitor Demo ===

Recent Errors/Warnings:
  [WARN] Memory usage above 75%
  [ERROR] Connection pool exhausted

Creating alerts...
[ALERT] Percentage CPU triggered greater_than 80 (current: 85)
[ACTION] Notifying ops-team-email
[ALERT] Available Memory MB triggered less_than 500 (current: 1148)
[ACTION] Notifying ops-team-sms

GCP Operations Suite

GCP’s monitoring stack (formerly Stackdriver) includes:

  • Cloud Monitoring — Metrics, dashboards, uptime checks
  • Cloud Logging — Centralized log storage and querying
  • Cloud Trace — Distributed tracing latency analysis
  • Cloud Profiler — Continuous CPU and heap profiling
  • Error Reporting — Automatic error grouping and alerting

Setting Up a CloudWatch Alarm (Real Example)

# Create a CloudWatch alarm for high CPU on an EC2 instance
aws cloudwatch put-metric-alarm \
  --alarm-name "web-prod-high-cpu" \
  --alarm-description "CPU > 80% for 5 minutes" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
# monitoring_dashboard.py
# Build a simple monitoring dashboard simulation
from datetime import datetime
import random

class Dashboard:
    def __init__(self, name):
        self.name = name
        self.widgets = []

    def add_widget(self, title, metric_func, unit=""):
        self.widgets.append({"title": title, "metric": metric_func, "unit": unit})

    def render(self):
        print(f"\n{'='*60}")
        print(f"  {self.name}")
        print(f"  Last updated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        print(f"{'='*60}\n")
        for widget in self.widgets:
            value = widget["metric"]()
            print(f"  📊 {widget['title']:<25} {value:>10.1f} {widget['unit']}")

dash = Dashboard("DodaTech Production Overview")
dash.add_widget("Web Servers (req/s)", lambda: random.uniform(150, 350), "req/s")
dash.add_widget("API Latency (p99)", lambda: random.uniform(45, 200), "ms")
dash.add_widget("Error Rate", lambda: random.uniform(0.1, 2.5), "%")
dash.add_widget("Database Connections", lambda: random.randint(10, 50), "conn")
dash.add_widget("Daily Active Users", lambda: random.randint(8000, 12000), "users")
dash.add_widget("S3 Storage (TB)", lambda: 12.4 + random.uniform(-0.1, 0.1), "TB")
dash.add_widget("Lambda Invocations/min", lambda: random.randint(200, 800), "inv")
dash.add_widget("Estimated Monthly Cost", lambda: random.uniform(4500, 5200), "USD")
dash.render()

Expected output:

============================================================
  DodaTech Production Overview
  Last updated: 2026-06-15 10:00:00
============================================================

  📊 Web Servers (req/s)           245.3 req/s
  📊 API Latency (p99)              87.2 ms
  📊 Error Rate                      1.1 %
  📊 Database Connections           34.0 conn
  📊 Daily Active Users          10543.0 users
  📊 S3 Storage (TB)                12.4 TB
  📊 Lambda Invocations/min        567.0 inv
  📊 Estimated Monthly Cost       4987.3 USD

Common Monitoring Mistakes

1. Alert Fatigue

Too many alerts that never get acted on. Teams ignore alerts, then miss critical ones. Only alert on actionable conditions, not every metric variance.

2. Not Setting Up Any Alerts

Your application goes down at 3 AM. You find out at 9 AM when users complain. Always set up at least basic health checks and error rate alerts.

3. Measuring Everything, Understanding Nothing

Collecting 1000 metrics but only looking at the dashboard once a week. Define SLOs (Service Level Objectives) and monitor only what matters for them.

4. No Log Aggregation

Logs are spread across 50 servers. Debugging requires SSH-ing into each one. Use centralized logging (CloudWatch Logs, Azure Log Analytics, GCP Logging).

5. Not Using Structured Logging

print(f"User {id} logged in") is hard to parse. Use structured JSON logs: {"event": "login", "user_id": 123, "timestamp": "..."} for automated analysis.

6. Ignoring Distributed Tracing

In a microservice architecture, knowing a request failed is useless. You need to know which service caused it. Implement tracing with OpenTelemetry.

Practice Questions

1. What are the three pillars of observability?

Metrics (numeric time-series data), Logs (timestamped events), and Traces (request flow across services). Together they provide complete visibility into system behavior.

2. What is the difference between monitoring and observability?

Monitoring is collecting and alerting on known metrics. Observability is the ability to understand unknown system states by exploring metrics, logs, and traces. Monitoring tells you what’s wrong; observability tells you why.

3. How does CloudWatch work?

CloudWatch collects metrics from AWS services (EC2 CPU, RDS connections, Lambda invocations) and custom application metrics. You create alarms that trigger actions (SNS, Auto Scaling) based on threshold conditions.

4. What is a good alert design pattern?

Alert on rate of errors (not individual occurrences), use multiple evaluation periods to avoid flapping, define runbooks for every alert, and test alerts regularly with chaos engineering.

5. Challenge: Design a monitoring strategy for a microservice application with 20 services, 10,000 requests/second, deployed on Kubernetes across 3 cloud regions.

Collect RED metrics (Rate, Errors, Duration) per service. Use Prometheus for metrics, OpenTelemetry for traces, and Loki for logs. Set SLOs (99.9% uptime, <200ms p99 latency). Create tiered alerts: pager for SLO violations, email for warnings, dashboard burn rate alerts.

Mini Project: Observability Dashboard Builder

# observability_dashboard.py
# Build a multi-service observability dashboard
import random
from datetime import datetime, timedelta

class Service:
    def __init__(self, name):
        self.name = name
        self.metrics = {"request_rate": 0, "error_rate": 0, "latency_p99": 0, "health": "healthy"}

    def collect(self):
        self.metrics["request_rate"] = random.randint(50, 500)
        self.metrics["error_rate"] = random.uniform(0, 3)
        self.metrics["latency_p99"] = random.uniform(20, 500)
        self.metrics["health"] = "degraded" if self.metrics["error_rate"] > 2 or self.metrics["latency_p99"] > 400 else "healthy"

class ObservabilityDashboard:
    def __init__(self):
        self.services = []

    def add_service(self, service):
        self.services.append(service)

    def refresh(self):
        print(f"\n=== Observability Dashboard [{datetime.now().strftime('%H:%M:%S')}] ===")
        print(f"{'Service':<20} {'Req/s':<10} {'Errors':<10} {'p99 (ms)':<12} {'Status'}")
        print("-" * 65)
        for svc in self.services:
            svc.collect()
            m = svc.metrics
            icon = "✓" if m["health"] == "healthy" else "⚠"
            print(f"{icon} {svc.name:<18} {m['request_rate']:<10} {m['error_rate']:<10.1f}% {m['latency_p99']:<12.0f}{m['health']}")

dash = ObservabilityDashboard()
for name in ["api-gateway", "user-service", "order-service", "payment-service", "notification-service", "analytics-service", "ml-inference"]:
    dash.add_service(Service(name))

dash.refresh()
dash.refresh()

Expected output:

=== Observability Dashboard [10:00:00] ===
Service              Req/s      Errors     p99 (ms)     Status
-----------------------------------------------------------------
✓ api-gateway         342       1.2%       145ms        healthy
✓ user-service        156       0.5%       89ms         healthy
⚠ order-service       278       2.8%       456ms        degraded
✓ payment-service     89        0.1%       234ms        healthy
✓ notification-svc    167       1.8%       67ms         healthy
...

Related Concepts

What’s Next

You now understand cloud monitoring and observability! Apply these concepts with Prometheus and Grafana for open-source monitoring, and explore cloud security for security monitoring.

  • Practice daily — Set up a dashboard for a personal project with UptimeRobot or Grafana Cloud
  • Build a project — Configure CloudWatch alarms for your AWS resources
  • Explore related topics — Check out OpenTelemetry for standardized observability instrumentation

Remember: every expert was once a beginner. Keep coding!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro