Skip to content
Monitoring Tools: Prometheus, Grafana, Datadog & More

Monitoring Tools: Prometheus, Grafana, Datadog & More

DodaTech Updated Jun 20, 2026 6 min read

Monitoring tools collect metrics, logs, and traces from your infrastructure to detect issues, track performance, and provide observability into every layer of your stack.

What You’ll Learn

  • Core monitoring concepts: metrics, logs, and traces
  • Setting up Prometheus for time-series metrics collection
  • Building Grafana dashboards for visualization
  • Configuring Datadog for SaaS-based monitoring
  • Writing alerting rules and defining SLI/SLO targets

Why Monitoring Tools Matter

Without monitoring, you’re flying blind. A server could be running at 100% CPU for hours before anyone notices — and by then, users have already left. Monitoring tools give you real-time visibility, historical trends, and automated alerts so you know about problems before your customers do. DodaTech uses Prometheus and Grafana to monitor Durga Antivirus Pro’s update servers — tracking request rates, error rates, and resource usage across a global fleet of distribution nodes.

    flowchart LR
    A[DevOps Basics] --> B[Monitoring Tools]
    B --> C[Metrics - Prometheus]
    B --> D[Logs - Elastic/Loki]
    B --> E[Traces - Jaeger]
    B --> F[Dashboards - Grafana]
    B --> G[Alerting]
    C --> H[Collect & Store]
    D --> I[Aggregate & Search]
    E --> J[Distributed Tracing]
    style B fill:#e6522c,color:#fff
  
Prerequisites: Basic Linux and Bash skills. Familiarity with Prometheus and Grafana basics helps.

Core Monitoring Concepts

Monitoring is built on three pillars: metrics (numbers over time), logs (event records), and traces (request paths through distributed systems).

Metrics with Prometheus

Prometheus scrapes metrics from targets at configured intervals and stores them in a time-series database.

# Install and run Prometheus with node_exporter
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar -xzf prometheus-*.tar.gz && cd prometheus-*

# Start Prometheus
./prometheus --config.file=prometheus.yml --storage.tsdb.retention.time=30d &

# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-*.tar.gz && cd node_exporter-*
./node_exporter &

# Verify metrics endpoint
curl http://localhost:9100/metrics | grep node_cpu_seconds
# Output:
# node_cpu_seconds_total{cpu="0",mode="idle"} 1234567.89
# node_cpu_seconds_total{cpu="0",mode="system"} 23456.78

PromQL Queries

PromQL is Prometheus’s query language for slicing metric data:

# CPU usage percentage per instance
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory available percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Request error rate (based on label status=5xx)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# Output: time-series data points at each scrape interval
# {instance="web-1:9100"} 23.5

Grafana Dashboards

Grafana visualizes Prometheus metrics in real-time dashboards:

{
  "title": "Production Overview",
  "panels": [
    {
      "title": "CPU by Instance",
      "type": "graph",
      "targets": [{
        "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
        "legendFormat": "{{ instance }}"
      }]
    },
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [{
        "expr": "rate(http_requests_total[5m])",
        "legendFormat": "requests/s"
      }]
    }
  ]
}

Output: Grafana renders panels as time-series graphs. You can add alert thresholds, annotations, and template variables for dynamic dashboards.

Alerting Rules

groups:
  - name: production_alerts
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU over 80% on {{ $labels.instance }}"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% on {{ $labels.instance }}"

Datadog Integration

Datadog is a SaaS monitoring platform with built-in integrations for hundreds of services:

# Install Datadog agent on a server
DD_API_KEY=your_api_key DD_SITE="datadoghq.com" bash -c \
  "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

# Configure integrations in /etc/datadog-agent/conf.d/
# Example: postgres.yaml
init_config:
instances:
  - host: localhost
    port: 5432
    username: datadog
    password: your_password

Output: Datadog automatically discovers running services and begins collecting metrics. The web UI shows pre-built dashboards for PostgreSQL, NGINX, AWS, and 600+ integrations.

Choosing the Right Tool

ToolTypeBest ForCost
PrometheusOpen-source metricsKubernetes, custom metrics, on-premFree
GrafanaVisualizationDashboards, multi-source, alertingFree / Cloud paid
DatadogSaaS full-stackAll-in-one, teams, 600+ integrationsPer-host pricing
New RelicSaaS APMApplication performance monitoringPer-GB ingestion
NagiosLegacy monitoringSimple check-based monitoringFree

Common Mistakes

  1. Alert fatigue from poorly tuned thresholds: Setting alerts that fire too often trains teams to ignore them. Use for: clauses to require sustained violations, and set distinct severity levels.

  2. Not using recording rules for expensive queries: Queries like histogram_quantile(0.99, ...) are slow. Pre-compute them with recording rules that run every scrape interval.

  3. Scraping too frequently: A 1-second scrape interval on thousands of targets generates enormous data volume. 15-30 seconds is sufficient for most infrastructure metrics.

  4. Ignoring label cardinality: Labels with many unique values (user IDs, email addresses) cause Prometheus to consume excessive memory. Keep cardinality under 100,000 per metric.

  5. Not setting retention limits: Prometheus defaults to unlimited disk usage. Set --storage.tsdb.retention.time=30d and --storage.tsdb.retention.size=50GB.

Practice Questions

  1. What are the three pillars of observability? Answer: Metrics (numbers over time), logs (event records), and traces (request paths through distributed systems).

  2. How does Prometheus collect metrics? Answer: Prometheus uses a pull model — it scrapes HTTP endpoints (targets) at configured intervals. Exporters translate third-party system metrics into Prometheus format.

  3. What is the difference between a counter and a gauge in Prometheus? Answer: A counter only increases (request count, errors). A gauge goes up and down (CPU usage, memory). Use rate() on counters, not gauges.

  4. When should you choose Datadog over Prometheus + Grafana? Answer: When you want an all-in-one SaaS solution with 600+ integrations, built-in APM, and less operational overhead managing your own monitoring stack.

Challenge

Set up a complete monitoring stack: install Prometheus and node_exporter on three servers, configure Grafana with a dashboard showing CPU/memory/disk across all servers, create alerting rules for high CPU (>80% for 10m) and low disk (<10%), route alerts to Slack via Alertmanager.

FAQ

What is the difference between monitoring and observability?
: Monitoring tells you when something is wrong (known unknowns). Observability lets you ask why it’s wrong (unknown unknowns) — it requires metrics, logs, and traces together.
Can Prometheus monitor Docker containers?
: Yes. Use cAdvisor for container-level metrics or the Prometheus Docker SDK integration for service discovery of running containers.
How does Datadog pricing work?
: Datadog charges per host per month for infrastructure monitoring, plus additional costs for APM, logs, and advanced features. Pricing scales with volume.
What is the difference between Prometheus and InfluxDB?
: Prometheus is pull-based with a built-in alerting system. InfluxDB is push-based with a SQL-like query language (Flux). Prometheus is dominant for Kubernetes monitoring.
How do I handle high availability for Prometheus?
: Prometheus doesn’t natively cluster. For HA, run two identical Prometheus servers and use Thanos for global querying, long-term storage, and downsampling.

Mini Project: Monitor a Web Application

# custom_exporter.py
# Expose custom application metrics for Prometheus
from prometheus_client import start_http_server, Counter, Histogram
import time
import random

REQUEST_COUNT = Counter('app_requests_total', 'Total requests', ['endpoint', 'status'])
REQUEST_DURATION = Histogram('app_request_duration_seconds', 'Request duration', ['endpoint'])

def process_request(endpoint):
    with REQUEST_DURATION.labels(endpoint=endpoint).time():
        time.sleep(random.uniform(0.01, 0.5))
        status = random.choice(['200', '200', '200', '500', '404'])
        REQUEST_COUNT.labels(endpoint=endpoint, status=status).inc()
        return status

if __name__ == '__main__':
    start_http_server(8000)
    while True:
        process_request('/api/users')
        process_request('/api/items')
        time.sleep(1)

Expected output: Prometheus scrapes http://localhost:8000/metrics for app_requests_total and app_request_duration_seconds. Grafana visualizes request rate (via rate(app_requests_total[5m])) and latency percentiles (via histogram_quantile(0.99, ...)).

What’s Next

TopicDescription
Centralized Logging
Log aggregation with ELK and Loki
Prometheus & Grafana
Deeper dive into the Prometheus stack

Related topics: Prometheus, Grafana, Datadog, SLI

What’s Next

Congratulations on completing this Monitoring Tools tutorial! Here’s where to go from here:

  • Practice daily — Set up Prometheus monitoring on a personal project
  • Build a project — Create a Grafana dashboard for your application
  • Explore related topics — Check out centralized logging and SRE fundamentals

Remember: every expert was once a beginner. Keep coding!

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro