Observability: Metrics, Tracing & Logging at Scale
Observability is the ability to understand a system’s internal state from its external outputs — metrics, traces, and logs — enabling operators to debug issues without deploying new code.
Why Observability Matters
Modern systems are too complex for traditional monitoring. A microservice deployment might have 200+ services, each running multiple instances, spread across Kubernetes clusters. When a user reports “the app is slow,” you need to find which service, which instance, and which request is responsible. Netflix runs thousands of microservices processing 2+ billion API edge requests daily — without observability, finding the root cause of a performance regression would take days instead of minutes. At DodaTech, observability patterns power real-time health monitoring in Durga Antivirus Pro and Doda Browser.
Plain-Language Explanation
Imagine you’re a doctor diagnosing a patient. You check vital signs (metrics — heart rate, temperature), you look at the patient’s history (logs — past symptoms, medications), and you trace how a specific symptom spreads (tracing — where does the pain start and radiate?). Each signal alone is useful; together they tell the full story. Observability is having all three for your software — and the tools to correlate them when something goes wrong.
graph TB
subgraph "Observability Signals"
M[Metrics
CPU, latency, error rate]
T[Traces
Request paths]
L[Logs
Events, errors]
end
M --> A[Alerting
Prometheus + Alertmanager]
T --> J[Jaeger / Zipkin]
L --> E[ELK / Loki]
A --> D[Grafana Dashboard]
J --> D
E --> D
D --> S[SLO Dashboard]
style M fill:#3498db,color:#fff
style T fill:#e67e22,color:#fff
style L fill:#27ae60,color:#fff
style D fill:#9b59b6,color:#fff
Metrics: RED and USE Methods
Metrics are numeric measurements collected over time. Two methodologies help decide what to measure:
RED Method (For Services)
| Metric | What It Measures | Example |
|---|---|---|
| Rate | Requests per second | 1500 req/s |
| Errors | Failed requests per second | 15 req/s (1%) |
| Duration | Latency distribution | p50: 50ms, p99: 500ms |
# Prometheus metrics for a web service
from prometheus_client import Counter, Histogram, start_http_server
import time, random
REQUESTS = Counter("http_requests_total", "Total HTTP requests", ["method", "endpoint"])
ERRORS = Counter("http_errors_total", "Total HTTP errors", ["status_code"])
LATENCY = Histogram("http_request_duration_seconds", "Request latency", ["method"])
def handle_request(method: str, endpoint: str):
start = time.time()
REQUESTS.labels(method=method, endpoint=endpoint).inc()
try:
time.sleep(random.uniform(0.01, 0.3))
if random.random() < 0.05: # 5% error rate
raise Exception("Internal error")
LATENCY.labels(method=method).observe(time.time() - start)
except Exception:
ERRORS.labels(status_code=500).inc()
print("Metrics exposed on :8000")
print("Example output at curl localhost:8000")
print("http_requests_total{method='GET',endpoint='/api/users'} 87")
print("http_errors_total{status_code='500'} 4")
print("http_request_duration_seconds_count{method='GET'} 87")Expected output:
Metrics exposed on :8000
Example output at curl localhost:8000
http_requests_total{method='GET',endpoint='/api/users'} 87
http_errors_total{status_code='500'} 4
http_request_duration_seconds_count{method='GET'} 87USE Method (For Resources)
| Metric | What It Measures | Example Value |
|---|---|---|
| Utilization | % of time resource is busy | CPU: 72% |
| Saturation | Queue length or over-provisioning | Disk I/O wait: 12% |
| Errors | Count of resource errors | Network drops: 0.01% |
Distributed Tracing with OpenTelemetry
Distributed tracing follows a single request across service boundaries. Each request gets a trace ID propagated via HTTP headers. Each service creates spans representing work units.
# OpenTelemetry tracing example
from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
jaeger = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
provider.add_span_processor(BatchSpanProcessor(jaeger))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order_id", "ORD-12345")
with tracer.start_as_current_span("validate_payment") as child:
child.set_attribute("amount", 29.99)
child.add_event("payment_processed")
with tracer.start_as_current_span("update_inventory"):
pass # Simulated work
print("Trace sent to Jaeger. View at http://localhost:16686")Jaeger vs Zipkin
| Feature | Jaeger | Zipkin |
|---|---|---|
| Storage | Elasticsearch, Cassandra | Cassandra, Elasticsearch |
| UI Features | Service graph, trace comparison | Timeline view, dependency graph |
| Sampling | Adaptive, probabilistic | Rate-limiting, probabilistic |
| OpenTelemetry | Native support | Via collector |
Structured Logging with ELK and Loki
Structured logs are JSON-formatted records that machines can parse and search efficiently. Each log entry includes a timestamp, severity level, service name, trace ID, and structured context.
import json, logging
class StructuredLogger:
def __init__(self, service: str):
self.service = service
self.logger = logging.getLogger(service)
def log(self, level: str, message: str, **kwargs):
entry = {
"timestamp": logging.Formatter().formatTime(logging.LogRecord(
"", 0, "", 0, "", (), None
)),
"level": level,
"service": self.service,
"message": message,
**kwargs
}
print(json.dumps(entry))
logger = StructuredLogger("order-service")
logger.log("INFO", "Order created", order_id="ORD-123", user_id=42, amount=29.99)
logger.log("ERROR", "Payment failed", order_id="ORD-123", error_code="INSUFFICIENT_FUNDS")Output:
{"timestamp": "2026-06-20 10:30:00,000", "level": "INFO", "service": "order-service", "message": "Order created", "order_id": "ORD-123", "user_id": 42, "amount": 29.99}
{"timestamp": "2026-06-20 10:30:00,500", "level": "ERROR", "service": "order-service", "message": "Payment failed", "order_id": "ORD-123", "error_code": "INSUFFICIENT_FUNDS"}ELK vs Loki
| Feature | ELK Stack | Loki |
|---|---|---|
| Storage | Elasticsearch (full-text index) | Object store (label-indexed) |
| Ingestion | Logstash | Promtail |
| Query | Kibana Query Language | LogQL (PromQL-like) |
| Cost | Higher (indexes everything) | Lower (indexes labels only) |
Alerting with Prometheus and Alertmanager
Alerting turns metrics into notifications. Prometheus evaluates alert rules, and Alertmanager handles deduplication, grouping, and routing.
# alerting-rules.yml
groups:
- name: service-health
rules:
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.05
for: 3m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for service {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 2m
labels:
severity: warning
annotations:
summary: "p99 latency above 1s"SLOs, SLIs, and SLAs
These define the reliability contract between the service and its users:
| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | Actual measured value | Request latency p99 |
| SLO (Service Level Objective) | Target value for SLI | p99 latency < 500ms 99.9% of the time |
| SLA (Service Level Agreement) | Contract with consequences | 99.95% uptime guarantee with credits |
# SLO compliance tracker
class SLO:
def __init__(self, name: str, target: float, window_seconds: int = 604800):
self.name = name
self.target = target
self.window_seconds = window_seconds
self.good_events = 0
self.total_events = 0
def record(self, good: bool):
self.total_events += 1
if good:
self.good_events += 1
@property
def compliance(self) -> float:
if self.total_events == 0:
return 1.0
return self.good_events / self.total_events
@property
def burn_rate(self) -> float:
error_budget = 1 - self.target
actual_errors = 1 - self.compliance
return actual_errors / error_budget if error_budget > 0 else 0
slo = SLO("p99-latency", target=0.999)
for _ in range(10000):
# Simulate 99.9% good requests
slo.record(good=random.random() < 0.999)
print(f"SLO compliance: {slo.compliance:.4%}")
print(f"Burn rate: {slo.burn_rate:.2f}")Correlation Between Signals
The real power of observability comes from correlating the three signals:
- Metric spikes → find the trace type → look at specific trace → check logs
- Error in logs → extract trace ID → find the full trace → check metric trends
- High p99 latency → drill into slow traces → identify service → check its CPU/RED metrics
Common Errors
No trace context propagation: Services don’t forward trace IDs via HTTP headers (traceparent, tracestate), making it impossible to correlate requests across services. Always configure your HTTP client to propagate tracing headers.
Logging without structure: Free-text logs like “User 42 logged in” can’t be parsed by ELK/Loki. Always use JSON with consistent field names (service, trace_id, duration_ms).
Too many alerts: Alerting on every metric spike creates alert fatigue. Only alert on SLO burn rate — symptoms that directly affect users, not causes like CPU spikes.
Sampling kills debugging: Probabilistic sampling with <1% rate means rare errors are almost never captured. Use head-based sampling for common cases and tail-based sampling for errors.
No cardinality control: Adding high-cardinality labels (user_id, request_id) to Prometheus metrics blows up the time-series database. Max 10K unique label values per metric.
Dashboard soup: Building hundreds of dashboards that nobody looks at. Create 3 tiers: executive (SLOs), service owner (RED), and deep-dive (troubleshooting).
Mini Project
Build a three-signal demo with simulated metrics, traces, and logs:
import random, time, json, threading
from collections import defaultdict
class ObservabilityDemo:
def __init__(self):
self.metrics = defaultdict(int)
self.traces = {}
self.logs = []
def simulate_request(self, service: str, endpoint: str):
trace_id = f"trace-{random.randint(1000, 9999)}"
start = time.time()
self.metrics[f"requests:{service}:{endpoint}"] += 1
success = random.random() > 0.05
duration = time.time() - start
self.traces[trace_id] = {
"service": service,
"duration_ms": round(duration * 1000, 2),
"success": success
}
self.logs.append(json.dumps({
"trace_id": trace_id,
"service": service,
"level": "INFO" if success else "ERROR",
"message": f"Request to {endpoint} {'succeeded' if success else 'failed'}",
"duration_ms": round(duration * 1000, 2)
}))
def report(self):
print("=== Metrics ===")
for k, v in sorted(self.metrics.items()):
print(f"{k}: {v}")
print("\n=== Recent Trace ===")
tid = max(self.traces.keys())
print(json.dumps(self.traces[tid], indent=2))
print("\n=== Recent Log ===")
print(self.logs[-1])
demo = ObservabilityDemo()
for _ in range(100):
demo.simulate_request("order-svc", "/api/orders")
demo.report()Cross-References
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro