Skip to content
Observability: Metrics, Tracing & Logging at Scale

Observability: Metrics, Tracing & Logging at Scale

DodaTech Updated Jun 20, 2026 7 min read

Observability is the ability to understand a system’s internal state from its external outputs — metrics, traces, and logs — enabling operators to debug issues without deploying new code.

Why Observability Matters

Modern systems are too complex for traditional monitoring. A microservice deployment might have 200+ services, each running multiple instances, spread across Kubernetes clusters. When a user reports “the app is slow,” you need to find which service, which instance, and which request is responsible. Netflix runs thousands of microservices processing 2+ billion API edge requests daily — without observability, finding the root cause of a performance regression would take days instead of minutes. At DodaTech, observability patterns power real-time health monitoring in Durga Antivirus Pro and Doda Browser.

Plain-Language Explanation

Imagine you’re a doctor diagnosing a patient. You check vital signs (metrics — heart rate, temperature), you look at the patient’s history (logs — past symptoms, medications), and you trace how a specific symptom spreads (tracing — where does the pain start and radiate?). Each signal alone is useful; together they tell the full story. Observability is having all three for your software — and the tools to correlate them when something goes wrong.


graph TB
    subgraph "Observability Signals"
        M[Metrics
CPU, latency, error rate] T[Traces
Request paths] L[Logs
Events, errors] end M --> A[Alerting
Prometheus + Alertmanager] T --> J[Jaeger / Zipkin] L --> E[ELK / Loki] A --> D[Grafana Dashboard] J --> D E --> D D --> S[SLO Dashboard] style M fill:#3498db,color:#fff style T fill:#e67e22,color:#fff style L fill:#27ae60,color:#fff style D fill:#9b59b6,color:#fff

Metrics: RED and USE Methods

Metrics are numeric measurements collected over time. Two methodologies help decide what to measure:

RED Method (For Services)

MetricWhat It MeasuresExample
RateRequests per second1500 req/s
ErrorsFailed requests per second15 req/s (1%)
DurationLatency distributionp50: 50ms, p99: 500ms
# Prometheus metrics for a web service
from prometheus_client import Counter, Histogram, start_http_server
import time, random

REQUESTS = Counter("http_requests_total", "Total HTTP requests", ["method", "endpoint"])
ERRORS = Counter("http_errors_total", "Total HTTP errors", ["status_code"])
LATENCY = Histogram("http_request_duration_seconds", "Request latency", ["method"])

def handle_request(method: str, endpoint: str):
    start = time.time()
    REQUESTS.labels(method=method, endpoint=endpoint).inc()
    try:
        time.sleep(random.uniform(0.01, 0.3))
        if random.random() < 0.05:  # 5% error rate
            raise Exception("Internal error")
        LATENCY.labels(method=method).observe(time.time() - start)
    except Exception:
        ERRORS.labels(status_code=500).inc()

print("Metrics exposed on :8000")
print("Example output at curl localhost:8000")
print("http_requests_total{method='GET',endpoint='/api/users'} 87")
print("http_errors_total{status_code='500'} 4")
print("http_request_duration_seconds_count{method='GET'} 87")

Expected output:

Metrics exposed on :8000
Example output at curl localhost:8000
http_requests_total{method='GET',endpoint='/api/users'} 87
http_errors_total{status_code='500'} 4
http_request_duration_seconds_count{method='GET'} 87

USE Method (For Resources)

MetricWhat It MeasuresExample Value
Utilization% of time resource is busyCPU: 72%
SaturationQueue length or over-provisioningDisk I/O wait: 12%
ErrorsCount of resource errorsNetwork drops: 0.01%

Distributed Tracing with OpenTelemetry

Distributed tracing follows a single request across service boundaries. Each request gets a trace ID propagated via HTTP headers. Each service creates spans representing work units.

# OpenTelemetry tracing example
from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
jaeger = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)
provider.add_span_processor(BatchSpanProcessor(jaeger))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order_id", "ORD-12345")
    with tracer.start_as_current_span("validate_payment") as child:
        child.set_attribute("amount", 29.99)
        child.add_event("payment_processed")
    with tracer.start_as_current_span("update_inventory"):
        pass  # Simulated work

print("Trace sent to Jaeger. View at http://localhost:16686")

Jaeger vs Zipkin

FeatureJaegerZipkin
StorageElasticsearch, CassandraCassandra, Elasticsearch
UI FeaturesService graph, trace comparisonTimeline view, dependency graph
SamplingAdaptive, probabilisticRate-limiting, probabilistic
OpenTelemetryNative supportVia collector

Structured Logging with ELK and Loki

Structured logs are JSON-formatted records that machines can parse and search efficiently. Each log entry includes a timestamp, severity level, service name, trace ID, and structured context.

import json, logging

class StructuredLogger:
    def __init__(self, service: str):
        self.service = service
        self.logger = logging.getLogger(service)

    def log(self, level: str, message: str, **kwargs):
        entry = {
            "timestamp": logging.Formatter().formatTime(logging.LogRecord(
                "", 0, "", 0, "", (), None
            )),
            "level": level,
            "service": self.service,
            "message": message,
            **kwargs
        }
        print(json.dumps(entry))

logger = StructuredLogger("order-service")
logger.log("INFO", "Order created", order_id="ORD-123", user_id=42, amount=29.99)
logger.log("ERROR", "Payment failed", order_id="ORD-123", error_code="INSUFFICIENT_FUNDS")

Output:

{"timestamp": "2026-06-20 10:30:00,000", "level": "INFO", "service": "order-service", "message": "Order created", "order_id": "ORD-123", "user_id": 42, "amount": 29.99}
{"timestamp": "2026-06-20 10:30:00,500", "level": "ERROR", "service": "order-service", "message": "Payment failed", "order_id": "ORD-123", "error_code": "INSUFFICIENT_FUNDS"}

ELK vs Loki

FeatureELK StackLoki
StorageElasticsearch (full-text index)Object store (label-indexed)
IngestionLogstashPromtail
QueryKibana Query LanguageLogQL (PromQL-like)
CostHigher (indexes everything)Lower (indexes labels only)

Alerting with Prometheus and Alertmanager

Alerting turns metrics into notifications. Prometheus evaluates alert rules, and Alertmanager handles deduplication, grouping, and routing.

# alerting-rules.yml
groups:
  - name: service-health
    rules:
      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for service {{ $labels.service }}"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 1s"

SLOs, SLIs, and SLAs

These define the reliability contract between the service and its users:

TermDefinitionExample
SLI (Service Level Indicator)Actual measured valueRequest latency p99
SLO (Service Level Objective)Target value for SLIp99 latency < 500ms 99.9% of the time
SLA (Service Level Agreement)Contract with consequences99.95% uptime guarantee with credits
# SLO compliance tracker
class SLO:
    def __init__(self, name: str, target: float, window_seconds: int = 604800):
        self.name = name
        self.target = target
        self.window_seconds = window_seconds
        self.good_events = 0
        self.total_events = 0

    def record(self, good: bool):
        self.total_events += 1
        if good:
            self.good_events += 1

    @property
    def compliance(self) -> float:
        if self.total_events == 0:
            return 1.0
        return self.good_events / self.total_events

    @property
    def burn_rate(self) -> float:
        error_budget = 1 - self.target
        actual_errors = 1 - self.compliance
        return actual_errors / error_budget if error_budget > 0 else 0

slo = SLO("p99-latency", target=0.999)
for _ in range(10000):
    # Simulate 99.9% good requests
    slo.record(good=random.random() < 0.999)
print(f"SLO compliance: {slo.compliance:.4%}")
print(f"Burn rate: {slo.burn_rate:.2f}")

Correlation Between Signals

The real power of observability comes from correlating the three signals:

  • Metric spikes → find the trace type → look at specific trace → check logs
  • Error in logs → extract trace ID → find the full trace → check metric trends
  • High p99 latency → drill into slow traces → identify service → check its CPU/RED metrics

Common Errors

  1. No trace context propagation: Services don’t forward trace IDs via HTTP headers (traceparent, tracestate), making it impossible to correlate requests across services. Always configure your HTTP client to propagate tracing headers.

  2. Logging without structure: Free-text logs like “User 42 logged in” can’t be parsed by ELK/Loki. Always use JSON with consistent field names (service, trace_id, duration_ms).

  3. Too many alerts: Alerting on every metric spike creates alert fatigue. Only alert on SLO burn rate — symptoms that directly affect users, not causes like CPU spikes.

  4. Sampling kills debugging: Probabilistic sampling with <1% rate means rare errors are almost never captured. Use head-based sampling for common cases and tail-based sampling for errors.

  5. No cardinality control: Adding high-cardinality labels (user_id, request_id) to Prometheus metrics blows up the time-series database. Max 10K unique label values per metric.

  6. Dashboard soup: Building hundreds of dashboards that nobody looks at. Create 3 tiers: executive (SLOs), service owner (RED), and deep-dive (troubleshooting).

Practice Questions

1. What’s the difference between monitoring and observability?
Monitoring tells you something is wrong (e.g., CPU at 95%). Observability tells you why it’s wrong (e.g., a specific query type is causing high CPU because of a missing index). Monitoring is known-unknowns; observability is unknown-unknowns.
2. How does distributed tracing work without code changes?
eBPF-based tools like Pixie and Cilium can auto-instrument network calls. Service meshes like Istio can generate spans for all HTTP/gRPC traffic. But for application-level context (transaction IDs), code-level instrumentation is still needed.
3. What is an error budget?
The acceptable amount of unreliability within an SLO window. If SLO is 99.9%, the error budget is 0.1% of total requests. When burn rate exceeds 1, the team stops shipping new features and focuses on reliability.
4. Challenge: Build a correlation dashboard.
Create a Grafana dashboard with 4 panels: service health (RED), trace explorer (Jaeger data source), log browser (Loki), and a correlation table that links trace IDs to log entries.

Mini Project

Build a three-signal demo with simulated metrics, traces, and logs:

import random, time, json, threading
from collections import defaultdict

class ObservabilityDemo:
    def __init__(self):
        self.metrics = defaultdict(int)
        self.traces = {}
        self.logs = []

    def simulate_request(self, service: str, endpoint: str):
        trace_id = f"trace-{random.randint(1000, 9999)}"
        start = time.time()
        self.metrics[f"requests:{service}:{endpoint}"] += 1
        success = random.random() > 0.05
        duration = time.time() - start
        self.traces[trace_id] = {
            "service": service,
            "duration_ms": round(duration * 1000, 2),
            "success": success
        }
        self.logs.append(json.dumps({
            "trace_id": trace_id,
            "service": service,
            "level": "INFO" if success else "ERROR",
            "message": f"Request to {endpoint} {'succeeded' if success else 'failed'}",
            "duration_ms": round(duration * 1000, 2)
        }))

    def report(self):
        print("=== Metrics ===")
        for k, v in sorted(self.metrics.items()):
            print(f"{k}: {v}")
        print("\n=== Recent Trace ===")
        tid = max(self.traces.keys())
        print(json.dumps(self.traces[tid], indent=2))
        print("\n=== Recent Log ===")
        print(self.logs[-1])

demo = ObservabilityDemo()
for _ in range(100):
    demo.simulate_request("order-svc", "/api/orders")
demo.report()

Cross-References

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro