Learn System: Observability: Metrics, Tracing & Logging at Scale

Q: **1. What's the difference between monitoring and observability?**

Monitoring tells you something is wrong (e.g., CPU at 95%). Observability tells you why it’s wrong (e.g., a specific query type is causing high CPU because of a missing index). Monitoring is known-unknowns; observability is unknown-unknowns.

Q: **2. How does distributed tracing work without code changes?**

eBPF-based tools like Pixie and Cilium can auto-instrument network calls. Service meshes like Istio can generate spans for all HTTP/gRPC traffic. But for application-level context (transaction IDs), code-level instrumentation is still needed.

Q: **3. What is an error budget?**

The acceptable amount of unreliability within an SLO window. If SLO is 99.9%, the error budget is 0.1% of total requests. When burn rate exceeds 1, the team stops shipping new features and focuses on reliability.

Q: **4. Challenge: Build a correlation dashboard.**

Create a Grafana dashboard with 4 panels: service health (RED), trace explorer (Jaeger data source), log browser (Loki), and a correlation table that links trace IDs to log entries.

System Design & Architecture

Observability: Metrics, Tracing & Logging at Scale

DodaTech Updated Jun 20, 2026 7 min read

Observability is the ability to understand a system’s internal state from its external outputs — metrics, traces, and logs — enabling operators to debug issues without deploying new code.

Why Observability Matters

Modern systems are too complex for traditional monitoring. A microservice deployment might have 200+ services, each running multiple instances, spread across Kubernetes clusters. When a user reports “the app is slow,” you need to find which service, which instance, and which request is responsible. Netflix runs thousands of microservices processing 2+ billion API edge requests daily — without observability, finding the root cause of a performance regression would take days instead of minutes. At DodaTech, observability patterns power real-time health monitoring in Durga Antivirus Pro and Doda Browser.

Plain-Language Explanation

Imagine you’re a doctor diagnosing a patient. You check vital signs (metrics — heart rate, temperature), you look at the patient’s history (logs — past symptoms, medications), and you trace how a specific symptom spreads (tracing — where does the pain start and radiate?). Each signal alone is useful; together they tell the full story. Observability is having all three for your software — and the tools to correlate them when something goes wrong.


graph TB
    subgraph "Observability Signals"
        M[Metrics
CPU, latency, error rate]
        T[Traces
Request paths]
        L[Logs
Events, errors]
    end
    M --> A[Alerting
Prometheus + Alertmanager]
    T --> J[Jaeger / Zipkin]
    L --> E[ELK / Loki]
    A --> D[Grafana Dashboard]
    J --> D
    E --> D
    D --> S[SLO Dashboard]
    
    style M fill:#3498db,color:#fff
    style T fill:#e67e22,color:#fff
    style L fill:#27ae60,color:#fff
    style D fill:#9b59b6,color:#fff

Metrics: RED and USE Methods

Metrics are numeric measurements collected over time. Two methodologies help decide what to measure:

RED Method (For Services)

Metric	What It Measures	Example
Rate	Requests per second	1500 req/s
Errors	Failed requests per second	15 req/s (1%)
Duration	Latency distribution	p50: 50ms, p99: 500ms

# Prometheus metrics for a web service
from prometheus_client import Counter, Histogram, start_http_server
import time, random

REQUESTS = Counter("http_requests_total", "Total HTTP requests", ["method", "endpoint"])
ERRORS = Counter("http_errors_total", "Total HTTP errors", ["status_code"])
LATENCY = Histogram("http_request_duration_seconds", "Request latency", ["method"])

def handle_request(method: str, endpoint: str):
    start = time.time()
    REQUESTS.labels(method=method, endpoint=endpoint).inc()
    try:
        time.sleep(random.uniform(0.01, 0.3))
        if random.random() < 0.05:  # 5% error rate
            raise Exception("Internal error")
        LATENCY.labels(method=method).observe(time.time() - start)
    except Exception:
        ERRORS.labels(status_code=500).inc()

print("Metrics exposed on :8000")
print("Example output at curl localhost:8000")
print("http_requests_total{method='GET',endpoint='/api/users'} 87")
print("http_errors_total{status_code='500'} 4")
print("http_request_duration_seconds_count{method='GET'} 87")

Expected output:

Metrics exposed on :8000
Example output at curl localhost:8000
http_requests_total{method='GET',endpoint='/api/users'} 87
http_errors_total{status_code='500'} 4
http_request_duration_seconds_count{method='GET'} 87

USE Method (For Resources)

Metric	What It Measures	Example Value
Utilization	% of time resource is busy	CPU: 72%
Saturation	Queue length or over-provisioning	Disk I/O wait: 12%
Errors	Count of resource errors	Network drops: 0.01%

Distributed Tracing with OpenTelemetry

Distributed tracing follows a single request across service boundaries. Each request gets a trace ID propagated via HTTP headers. Each service creates spans representing work units.

# OpenTelemetry tracing example
from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
jaeger = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)
provider.add_span_processor(BatchSpanProcessor(jaeger))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order_id", "ORD-12345")
    with tracer.start_as_current_span("validate_payment") as child:
        child.set_attribute("amount", 29.99)
        child.add_event("payment_processed")
    with tracer.start_as_current_span("update_inventory"):
        pass  # Simulated work

print("Trace sent to Jaeger. View at http://localhost:16686")

Jaeger vs Zipkin

Feature	Jaeger	Zipkin
Storage	Elasticsearch, Cassandra	Cassandra, Elasticsearch
UI Features	Service graph, trace comparison	Timeline view, dependency graph
Sampling	Adaptive, probabilistic	Rate-limiting, probabilistic
OpenTelemetry	Native support	Via collector

Structured Logging with ELK and Loki

Structured logs are JSON-formatted records that machines can parse and search efficiently. Each log entry includes a timestamp, severity level, service name, trace ID, and structured context.

import json, logging

class StructuredLogger:
    def __init__(self, service: str):
        self.service = service
        self.logger = logging.getLogger(service)

    def log(self, level: str, message: str, **kwargs):
        entry = {
            "timestamp": logging.Formatter().formatTime(logging.LogRecord(
                "", 0, "", 0, "", (), None
            )),
            "level": level,
            "service": self.service,
            "message": message,
            **kwargs
        }
        print(json.dumps(entry))

logger = StructuredLogger("order-service")
logger.log("INFO", "Order created", order_id="ORD-123", user_id=42, amount=29.99)
logger.log("ERROR", "Payment failed", order_id="ORD-123", error_code="INSUFFICIENT_FUNDS")

Output:

{"timestamp": "2026-06-20 10:30:00,000", "level": "INFO", "service": "order-service", "message": "Order created", "order_id": "ORD-123", "user_id": 42, "amount": 29.99}
{"timestamp": "2026-06-20 10:30:00,500", "level": "ERROR", "service": "order-service", "message": "Payment failed", "order_id": "ORD-123", "error_code": "INSUFFICIENT_FUNDS"}

ELK vs Loki

Feature	ELK Stack	Loki
Storage	Elasticsearch (full-text index)	Object store (label-indexed)
Ingestion	Logstash	Promtail
Query	Kibana Query Language	LogQL (PromQL-like)
Cost	Higher (indexes everything)	Lower (indexes labels only)

Alerting with Prometheus and Alertmanager

Alerting turns metrics into notifications. Prometheus evaluates alert rules, and Alertmanager handles deduplication, grouping, and routing.

# alerting-rules.yml
groups:
  - name: service-health
    rules:
      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for service {{ $labels.service }}"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 1s"

SLOs, SLIs, and SLAs

These define the reliability contract between the service and its users:

Term	Definition	Example
SLI (Service Level Indicator)	Actual measured value	Request latency p99
SLO (Service Level Objective)	Target value for SLI	p99 latency < 500ms 99.9% of the time
SLA (Service Level Agreement)	Contract with consequences	99.95% uptime guarantee with credits

# SLO compliance tracker
class SLO:
    def __init__(self, name: str, target: float, window_seconds: int = 604800):
        self.name = name
        self.target = target
        self.window_seconds = window_seconds
        self.good_events = 0
        self.total_events = 0

    def record(self, good: bool):
        self.total_events += 1
        if good:
            self.good_events += 1

    @property
    def compliance(self) -> float:
        if self.total_events == 0:
            return 1.0
        return self.good_events / self.total_events

    @property
    def burn_rate(self) -> float:
        error_budget = 1 - self.target
        actual_errors = 1 - self.compliance
        return actual_errors / error_budget if error_budget > 0 else 0

slo = SLO("p99-latency", target=0.999)
for _ in range(10000):
    # Simulate 99.9% good requests
    slo.record(good=random.random() < 0.999)
print(f"SLO compliance: {slo.compliance:.4%}")
print(f"Burn rate: {slo.burn_rate:.2f}")

Correlation Between Signals

The real power of observability comes from correlating the three signals:

Metric spikes → find the trace type → look at specific trace → check logs
Error in logs → extract trace ID → find the full trace → check metric trends
High p99 latency → drill into slow traces → identify service → check its CPU/RED metrics

Common Errors

No trace context propagation: Services don’t forward trace IDs via HTTP headers (traceparent, tracestate), making it impossible to correlate requests across services. Always configure your HTTP client to propagate tracing headers.
Logging without structure: Free-text logs like “User 42 logged in” can’t be parsed by ELK/Loki. Always use JSON with consistent field names (service, trace_id, duration_ms).
Too many alerts: Alerting on every metric spike creates alert fatigue. Only alert on SLO burn rate — symptoms that directly affect users, not causes like CPU spikes.
Sampling kills debugging: Probabilistic sampling with <1% rate means rare errors are almost never captured. Use head-based sampling for common cases and tail-based sampling for errors.
No cardinality control: Adding high-cardinality labels (user_id, request_id) to Prometheus metrics blows up the time-series database. Max 10K unique label values per metric.
Dashboard soup: Building hundreds of dashboards that nobody looks at. Create 3 tiers: executive (SLOs), service owner (RED), and deep-dive (troubleshooting).

Practice Questions

1. What’s the difference between monitoring and observability?

Monitoring tells you something is wrong (e.g., CPU at 95%). Observability tells you why it’s wrong (e.g., a specific query type is causing high CPU because of a missing index). Monitoring is known-unknowns; observability is unknown-unknowns.

2. How does distributed tracing work without code changes?

eBPF-based tools like Pixie and Cilium can auto-instrument network calls. Service meshes like Istio can generate spans for all HTTP/gRPC traffic. But for application-level context (transaction IDs), code-level instrumentation is still needed.

3. What is an error budget?

The acceptable amount of unreliability within an SLO window. If SLO is 99.9%, the error budget is 0.1% of total requests. When burn rate exceeds 1, the team stops shipping new features and focuses on reliability.

4. Challenge: Build a correlation dashboard.

Create a Grafana dashboard with 4 panels: service health (RED), trace explorer (Jaeger data source), log browser (Loki), and a correlation table that links trace IDs to log entries.

Mini Project

Build a three-signal demo with simulated metrics, traces, and logs:

import random, time, json, threading
from collections import defaultdict

class ObservabilityDemo:
    def __init__(self):
        self.metrics = defaultdict(int)
        self.traces = {}
        self.logs = []

    def simulate_request(self, service: str, endpoint: str):
        trace_id = f"trace-{random.randint(1000, 9999)}"
        start = time.time()
        self.metrics[f"requests:{service}:{endpoint}"] += 1
        success = random.random() > 0.05
        duration = time.time() - start
        self.traces[trace_id] = {
            "service": service,
            "duration_ms": round(duration * 1000, 2),
            "success": success
        }
        self.logs.append(json.dumps({
            "trace_id": trace_id,
            "service": service,
            "level": "INFO" if success else "ERROR",
            "message": f"Request to {endpoint} {'succeeded' if success else 'failed'}",
            "duration_ms": round(duration * 1000, 2)
        }))

    def report(self):
        print("=== Metrics ===")
        for k, v in sorted(self.metrics.items()):
            print(f"{k}: {v}")
        print("\n=== Recent Trace ===")
        tid = max(self.traces.keys())
        print(json.dumps(self.traces[tid], indent=2))
        print("\n=== Recent Log ===")
        print(self.logs[-1])

demo = ObservabilityDemo()
for _ in range(100):
    demo.simulate_request("order-svc", "/api/orders")
demo.report()

Cross-References

Previous Distributed Consensus: Paxos, Raft & Leader Election Explained

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse System Design & Architecture