API Monitoring and Analytics — Complete Guide

DodaTech Updated 2026-06-23 8 min read

In this tutorial, you'll learn about API Monitoring and Analytics. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

API monitoring is the practice of tracking API performance, availability, error rates, and usage patterns through metrics, logs, traces, and alerts to ensure reliable operation and data-driven improvements.

What You'll Learn

You will learn the four pillars of API Observability metrics, logs, traces, and alerts along with tool setup using Prometheus, Grafana, ELK Stack, and structured logging for production API monitoring.

Why API Monitoring Matters

Without monitoring, you discover API problems when users report them. Proactive monitoring detects issues before they affect customers. APIs are business-critical systems. A 500ms increase in API latency correlates with a 7 percent reduction in conversion rates. Monitoring provides the data needed to maintain SLAs, optimize performance, and plan capacity.

Real-World Use

DodaTech monitors every API endpoint across products. Doda Browser sync API has Grafana dashboards showing real-time sync latency, DodaZIP update service uses Prometheus metrics for distribution tracking, and Durga Antivirus Pro threat intelligence API has automated alerts that page engineers when error rates exceed thresholds.

API Monitoring Learning Path

flowchart LR
  A[REST Api Design] --> B[Monitoring Pillars]
  B --> C[Metrics]
  B --> D[Logging]
  B --> E[Tracing]
  B --> F[Alerting]
  C --> G[Dashboards]
  D --> G
  E --> G
  F --> G
  B:::current

  classDef current fill:#f90,color:#fff,stroke:#333,stroke-width:2px

Prerequisites

Understand RESTful Api Design Best Practices and API Development Concepts. Familiarity with Docker Basics is helpful for running monitoring tools locally.

The Four Pillars of API Observability

1. Metrics

Metrics are numerical measurements collected over time that reveal trends and patterns.

RED Metrics: Rate, Errors, Duration

Metric	Description	Example
Request Rate	Requests per second	250 req/s
Error Rate	Percentage of failed requests	0.5 percent
Latency (p50)	Median response time	45ms
Latency (p95)	95th percentile response time	120ms
Latency (p99)	99th percentile response time	350ms
Active Users	Concurrent active users	1,240

Prometheus Metrics from Express

const Prometheus = require("prom-client");

// Create a registry
const Register = new Prometheus.Registry();
Prometheus.collectDefaultMetrics({ Register });

// Define custom metrics
const httpRequestDuration = new Prometheus.Histogram({
  name: "HTTP_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status_code"],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new Prometheus.Counter({
  name: "HTTP_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "route", "status_code"]
});

const activeRequests = new Prometheus.Gauge({
  name: "HTTP_requests_active",
  help: "Number of active HTTP requests"
});

Register.registerMetric(httpRequestDuration);
Register.registerMetric(httpRequestTotal);
Register.registerMetric(activeRequests);

// Middleware to record metrics
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  activeRequests.inc();

  res.on("finish", () => {
    const labels = {
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    };

    httpRequestTotal.inc(labels);
    end(labels);
    activeRequests.dec();
  });

  next();
});

// Metrics endpoint for Prometheus to scrape
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", Register.contentType);
  res.end(await Register.metrics());
});

Prometheus Metrics from FastAPI

from Prometheus_client import Counter, Histogram, generate_latest, REGISTRY
from FastAPI import FastAPI, Request
from FastAPI.responses import Response
import time

app = FastAPI()

REQUEST_COUNT = Counter(
    "HTTP_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)

REQUEST_DURATION = Histogram(
    "HTTP_request_duration_seconds",
    "HTTP request duration",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5]
)

@app.middleware("HTTP")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.URL.path,
        status=response.status_code
    ).inc()

    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.URL.path
    ).observe(duration)

    return response

@app.get("/metrics")
async def metrics():
    return Response(
        content=generate_latest(REGISTRY),
        media_type="text/plain"
    )

Grafana Dashboard Setup

# docker-compose.yml for monitoring stack
version: '3'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    ports:
      - "3000:3000"
    depends_on:
      - prometheus

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'dodatech-api'
    static_configs:
      - targets: ['host.docker.internal:3000']
    metrics_path: '/metrics'

2. Logging

Structured logging produces machine-readable log entries that can be searched and analyzed.

// Structured logging with pino
const pino = require("pino");

const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level(label) {
      return { level: label };
    }
  },
  timestamp: pino.stdTimeFunctions.isoTime
});

// Request logging middleware
app.use((req, res, next) => {
  const start = Date.now();

  res.on("finish", () => {
    logger.info({
      requestId: req.id,
      method: req.method,
      path: req.path,
      statusCode: res.statusCode,
      duration: Date.now() - start,
      userAgent: req.headers["user-agent"],
      ip: req.ip
    }, "request completed");
  });

  next();
});

Log Aggregation with ELK Stack

version: '3'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"

  logstash:
    image: docker.elastic.co/logstash/logstash:8.12
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    ports:
      - "5000:5000"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

3. Distributed Tracing

Tracing follows a request across multiple services.

const { NodeTracerProvider } = require("@opentelemetry/node");
const { SimpleSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { ExpressInstrumentation } = require("@opentelemetry/instrumentation-express");

const provider = new NodeTracerProvider();
provider.addSpanProcessor(
  new SimpleSpanProcessor(new JaegerExporter())
);
provider.register();

// Instrument Express
const expressInstrumentation = new ExpressInstrumentation();
expressInstrumentation.setTracerProvider(provider);

4. Alerting

Alerting notifies you when metrics cross defined thresholds.

# prometheus-alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status_code=~"5.."}[5m])
          /
          rate(http_requests_total[5m]) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "API error rate above 1 percent"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API p95 latency above 500ms"

      - alert: LowAvailability
        expr: |
          up{job="dodatech-api"} < 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "API instance is down"

Key Metrics Dashboard

graph TD
  A[API Monitoring Dashboard] --> B[Traffic]
  A --> C[Performance]
  A --> D[Errors]
  A --> E[Resources]
  B --> B1[Requests/sec]
  B --> B2[Active Users]
  B --> B3[Bandwidth]
  C --> C1[p50 Latency]
  C --> C2[p95 Latency]
  C --> C3[p99 Latency]
  D --> D1[4xx Rate]
  D --> D2[5xx Rate]
  D --> D3[Error Breakdown]
  E --> E1[CPU Usage]
  E --> E2[Memory Usage]
  E --> E3[DB Connections]

API Health Check Endpoint

app.get("/health", async (req, res) => {
  const health = {
    status: "ok",
    timestamp: new Date().toISOString(),
    uptime: Process.uptime(),
    checks: {}
  };

  // Check database
  try {
    await db.query("SELECT 1");
    health.checks.database = { status: "ok" };
  } catch (error) {
    health.checks.database = { status: "error", message: error.message };
    health.status = "degraded";
  }

  // Check external dependencies
  try {
    await Axios.get("HTTPS://auth.dodatech.com/health");
    health.checks.auth = { status: "ok" };
  } catch (error) {
    health.checks.auth = { status: "error", message: error.message };
    health.status = "degraded";
  }

  const statusCode = health.status === "ok" ? 200 : 503;
  res.status(statusCode).JSON(health);
});

Common Errors

No health check endpoint — Deploying APIs without a /health endpoint. Load balancers and orchestrators need health checks to route traffic. Always implement health checks that verify critical dependencies.
Logging too much or too little — Logging every debug statement in production or logging nothing. Use log levels (debug, info, warn, error) and configure the appropriate level per environment.
Not setting up alerts — Collecting metrics but never alerting on anomalies. Define alert rules for error rate, latency, and availability. Test alerts regularly to ensure they fire correctly.
Ignoring p99 latency — Monitoring only average latency. Averages hide slow requests. Track p50, p95, and p99 to understand the full latency distribution.
No distributed tracing — Having multiple Microservices but no tracing. When a request fails, you cannot determine which service caused it. Implement distributed tracing to correlate requests across services.
Dashboard overload — Creating dashboards with 50+ metrics that no one can understand. Start with RED metrics (Rate, Errors, Duration) for each service. Add more metrics gradually based on actual debugging needs.
Not monitoring client-side — Monitoring only server-side metrics. Slow internet connections or client-side errors are invisible from the server. Implement client-side monitoring with Real User Monitoring (RUM).

Practice Questions

What are the four pillars of API Observability?
What are RED metrics and why are they important?
How do you implement a health check endpoint?
What is the difference between p50, p95, and p99 latency?
How do you set up Prometheus alerting rules?

Challenge

Set up a complete monitoring Stack for a REST API. Implement: Prometheus metrics for request rate, error rate, and latency (p50, p95, p99), structured logging with pino and JSON format, a Grafana dashboard showing all metrics with RED-focused panels, health check endpoint that verifies database and external dependencies, Prometheus alerting rules for error rate above 1 percent and p95 above 500ms, and distributed tracing with OpenTelemetry and Jaeger for a three-service architecture.

FAQ

What is the difference between monitoring and Observability? Monitoring is collecting and alerting on known failure modes. Observability is the ability to understand system behavior from the data it produces. Monitoring tells you something is wrong. Observability tells you why.

How often should I check API health? Health checks should run every 5-15 seconds from load balancers. Synthetic monitoring (external checks) should run every 1-5 minutes from multiple geographic locations.

What metrics should I alert on? Alert on error rate (5xx responses), latency (p95 above threshold), availability (down instances), rate limits hit (429 responses), and slow database queries. Start with 5-10 alert rules and add more as you identify failure patterns.

How long should I retain API logs? Keep logs for 30-90 days in hot storage (Elasticsearch) and 1-2 years in cold storage (S3 Glacier). Compliance requirements may mandate longer retention. Use log levels to reduce storage costs.

Do I need APM tools for API monitoring? Application Performance Monitoring (APM) tools like Datadog, New Relic, or Dynatrace provide comprehensive monitoring out of the box. For small teams, Prometheus and Grafana are cost-effective alternatives.

← Previous Building Serverless APIs with AWS Lambda and API Gateway — Guide Next → Hypermedia APIs and HATEOAS — Complete Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Api Development