API Monitoring and Analytics — Complete Guide
In this tutorial, you'll learn about API Monitoring and Analytics. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
API monitoring is the practice of tracking API performance, availability, error rates, and usage patterns through metrics, logs, traces, and alerts to ensure reliable operation and data-driven improvements.
What You'll Learn
You will learn the four pillars of API Observability metrics, logs, traces, and alerts along with tool setup using Prometheus, Grafana, ELK Stack, and structured logging for production API monitoring.
Why API Monitoring Matters
Without monitoring, you discover API problems when users report them. Proactive monitoring detects issues before they affect customers. APIs are business-critical systems. A 500ms increase in API latency correlates with a 7 percent reduction in conversion rates. Monitoring provides the data needed to maintain SLAs, optimize performance, and plan capacity.
Real-World Use
DodaTech monitors every API endpoint across products. Doda Browser sync API has Grafana dashboards showing real-time sync latency, DodaZIP update service uses Prometheus metrics for distribution tracking, and Durga Antivirus Pro threat intelligence API has automated alerts that page engineers when error rates exceed thresholds.
API Monitoring Learning Path
flowchart LR
A[REST Api Design] --> B[Monitoring Pillars]
B --> C[Metrics]
B --> D[Logging]
B --> E[Tracing]
B --> F[Alerting]
C --> G[Dashboards]
D --> G
E --> G
F --> G
B:::current
classDef current fill:#f90,color:#fff,stroke:#333,stroke-width:2px
Prerequisites
Understand RESTful Api Design Best Practices and API Development Concepts. Familiarity with Docker Basics is helpful for running monitoring tools locally.
The Four Pillars of API Observability
1. Metrics
Metrics are numerical measurements collected over time that reveal trends and patterns.
RED Metrics: Rate, Errors, Duration
| Metric | Description | Example |
|---|---|---|
| Request Rate | Requests per second | 250 req/s |
| Error Rate | Percentage of failed requests | 0.5 percent |
| Latency (p50) | Median response time | 45ms |
| Latency (p95) | 95th percentile response time | 120ms |
| Latency (p99) | 99th percentile response time | 350ms |
| Active Users | Concurrent active users | 1,240 |
Prometheus Metrics from Express
const Prometheus = require("prom-client");
// Create a registry
const Register = new Prometheus.Registry();
Prometheus.collectDefaultMetrics({ Register });
// Define custom metrics
const httpRequestDuration = new Prometheus.Histogram({
name: "HTTP_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new Prometheus.Counter({
name: "HTTP_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "route", "status_code"]
});
const activeRequests = new Prometheus.Gauge({
name: "HTTP_requests_active",
help: "Number of active HTTP requests"
});
Register.registerMetric(httpRequestDuration);
Register.registerMetric(httpRequestTotal);
Register.registerMetric(activeRequests);
// Middleware to record metrics
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
activeRequests.inc();
res.on("finish", () => {
const labels = {
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode
};
httpRequestTotal.inc(labels);
end(labels);
activeRequests.dec();
});
next();
});
// Metrics endpoint for Prometheus to scrape
app.get("/metrics", async (req, res) => {
res.set("Content-Type", Register.contentType);
res.end(await Register.metrics());
});
Prometheus Metrics from FastAPI
from Prometheus_client import Counter, Histogram, generate_latest, REGISTRY
from FastAPI import FastAPI, Request
from FastAPI.responses import Response
import time
app = FastAPI()
REQUEST_COUNT = Counter(
"HTTP_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
REQUEST_DURATION = Histogram(
"HTTP_request_duration_seconds",
"HTTP request duration",
["method", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5]
)
@app.middleware("HTTP")
async def metrics_middleware(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.URL.path,
status=response.status_code
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.URL.path
).observe(duration)
return response
@app.get("/metrics")
async def metrics():
return Response(
content=generate_latest(REGISTRY),
media_type="text/plain"
)
Grafana Dashboard Setup
# docker-compose.yml for monitoring stack
version: '3'
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
ports:
- "3000:3000"
depends_on:
- prometheus
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'dodatech-api'
static_configs:
- targets: ['host.docker.internal:3000']
metrics_path: '/metrics'
2. Logging
Structured logging produces machine-readable log entries that can be searched and analyzed.
// Structured logging with pino
const pino = require("pino");
const logger = pino({
level: process.env.LOG_LEVEL || "info",
formatters: {
level(label) {
return { level: label };
}
},
timestamp: pino.stdTimeFunctions.isoTime
});
// Request logging middleware
app.use((req, res, next) => {
const start = Date.now();
res.on("finish", () => {
logger.info({
requestId: req.id,
method: req.method,
path: req.path,
statusCode: res.statusCode,
duration: Date.now() - start,
userAgent: req.headers["user-agent"],
ip: req.ip
}, "request completed");
});
next();
});
Log Aggregation with ELK Stack
version: '3'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
logstash:
image: docker.elastic.co/logstash/logstash:8.12
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
ports:
- "5000:5000"
kibana:
image: docker.elastic.co/kibana/kibana:8.12
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
- elasticsearch
3. Distributed Tracing
Tracing follows a request across multiple services.
const { NodeTracerProvider } = require("@opentelemetry/node");
const { SimpleSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { ExpressInstrumentation } = require("@opentelemetry/instrumentation-express");
const provider = new NodeTracerProvider();
provider.addSpanProcessor(
new SimpleSpanProcessor(new JaegerExporter())
);
provider.register();
// Instrument Express
const expressInstrumentation = new ExpressInstrumentation();
expressInstrumentation.setTracerProvider(provider);
4. Alerting
Alerting notifies you when metrics cross defined thresholds.
# prometheus-alerts.yml
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status_code=~"5.."}[5m])
/
rate(http_requests_total[5m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "API error rate above 1 percent"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "API p95 latency above 500ms"
- alert: LowAvailability
expr: |
up{job="dodatech-api"} < 1
for: 1m
labels:
severity: critical
annotations:
summary: "API instance is down"
Key Metrics Dashboard
graph TD
A[API Monitoring Dashboard] --> B[Traffic]
A --> C[Performance]
A --> D[Errors]
A --> E[Resources]
B --> B1[Requests/sec]
B --> B2[Active Users]
B --> B3[Bandwidth]
C --> C1[p50 Latency]
C --> C2[p95 Latency]
C --> C3[p99 Latency]
D --> D1[4xx Rate]
D --> D2[5xx Rate]
D --> D3[Error Breakdown]
E --> E1[CPU Usage]
E --> E2[Memory Usage]
E --> E3[DB Connections]
API Health Check Endpoint
app.get("/health", async (req, res) => {
const health = {
status: "ok",
timestamp: new Date().toISOString(),
uptime: Process.uptime(),
checks: {}
};
// Check database
try {
await db.query("SELECT 1");
health.checks.database = { status: "ok" };
} catch (error) {
health.checks.database = { status: "error", message: error.message };
health.status = "degraded";
}
// Check external dependencies
try {
await Axios.get("HTTPS://auth.dodatech.com/health");
health.checks.auth = { status: "ok" };
} catch (error) {
health.checks.auth = { status: "error", message: error.message };
health.status = "degraded";
}
const statusCode = health.status === "ok" ? 200 : 503;
res.status(statusCode).JSON(health);
});
Common Errors
No health check endpoint — Deploying APIs without a
/healthendpoint. Load balancers and orchestrators need health checks to route traffic. Always implement health checks that verify critical dependencies.Logging too much or too little — Logging every debug statement in production or logging nothing. Use log levels (debug, info, warn, error) and configure the appropriate level per environment.
Not setting up alerts — Collecting metrics but never alerting on anomalies. Define alert rules for error rate, latency, and availability. Test alerts regularly to ensure they fire correctly.
Ignoring p99 latency — Monitoring only average latency. Averages hide slow requests. Track p50, p95, and p99 to understand the full latency distribution.
No distributed tracing — Having multiple Microservices but no tracing. When a request fails, you cannot determine which service caused it. Implement distributed tracing to correlate requests across services.
Dashboard overload — Creating dashboards with 50+ metrics that no one can understand. Start with RED metrics (Rate, Errors, Duration) for each service. Add more metrics gradually based on actual debugging needs.
Not monitoring client-side — Monitoring only server-side metrics. Slow internet connections or client-side errors are invisible from the server. Implement client-side monitoring with Real User Monitoring (RUM).
Practice Questions
- What are the four pillars of API Observability?
- What are RED metrics and why are they important?
- How do you implement a health check endpoint?
- What is the difference between p50, p95, and p99 latency?
- How do you set up Prometheus alerting rules?
Challenge
Set up a complete monitoring Stack for a REST API. Implement: Prometheus metrics for request rate, error rate, and latency (p50, p95, p99), structured logging with pino and JSON format, a Grafana dashboard showing all metrics with RED-focused panels, health check endpoint that verifies database and external dependencies, Prometheus alerting rules for error rate above 1 percent and p95 above 500ms, and distributed tracing with OpenTelemetry and Jaeger for a three-service architecture.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro