Learn DevOps: Monitoring Tools: Prometheus, Grafana, Datadog & More

Q: What is the difference between monitoring and observability?

: Monitoring tells you when something is wrong (known unknowns). Observability lets you ask why it’s wrong (unknown unknowns) — it requires metrics, logs, and traces together.

Q: Can Prometheus monitor Docker containers?

: Yes. Use cAdvisor for container-level metrics or the Prometheus Docker SDK integration for service discovery of running containers.

Q: How does Datadog pricing work?

: Datadog charges per host per month for infrastructure monitoring, plus additional costs for APM, logs, and advanced features. Pricing scales with volume.

Q: What is the difference between Prometheus and InfluxDB?

: Prometheus is pull-based with a built-in alerting system. InfluxDB is push-based with a SQL-like query language (Flux). Prometheus is dominant for Kubernetes monitoring.

Q: How do I handle high availability for Prometheus?

: Prometheus doesn’t natively cluster. For HA, run two identical Prometheus servers and use Thanos for global querying, long-term storage, and downsampling.

DevOps & Cloud

Monitoring Tools: Prometheus, Grafana, Datadog & More

DodaTech Updated Jun 20, 2026 6 min read

Monitoring tools collect metrics, logs, and traces from your infrastructure to detect issues, track performance, and provide observability into every layer of your stack.

What You’ll Learn

Core monitoring concepts: metrics, logs, and traces
Setting up Prometheus for time-series metrics collection
Building Grafana dashboards for visualization
Configuring Datadog for SaaS-based monitoring
Writing alerting rules and defining SLI/SLO targets

Why Monitoring Tools Matter

Without monitoring, you’re flying blind. A server could be running at 100% CPU for hours before anyone notices — and by then, users have already left. Monitoring tools give you real-time visibility, historical trends, and automated alerts so you know about problems before your customers do. DodaTech uses Prometheus and Grafana to monitor Durga Antivirus Pro’s update servers — tracking request rates, error rates, and resource usage across a global fleet of distribution nodes.

    flowchart LR
    A[DevOps Basics] --> B[Monitoring Tools]
    B --> C[Metrics - Prometheus]
    B --> D[Logs - Elastic/Loki]
    B --> E[Traces - Jaeger]
    B --> F[Dashboards - Grafana]
    B --> G[Alerting]
    C --> H[Collect & Store]
    D --> I[Aggregate & Search]
    E --> J[Distributed Tracing]
    style B fill:#e6522c,color:#fff

Prerequisites: Basic Linux and Bash skills. Familiarity with Prometheus and Grafana basics helps.

Core Monitoring Concepts

Monitoring is built on three pillars: metrics (numbers over time), logs (event records), and traces (request paths through distributed systems).

Metrics with Prometheus

Prometheus scrapes metrics from targets at configured intervals and stores them in a time-series database.

# Install and run Prometheus with node_exporter
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar -xzf prometheus-*.tar.gz && cd prometheus-*

# Start Prometheus
./prometheus --config.file=prometheus.yml --storage.tsdb.retention.time=30d &

# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-*.tar.gz && cd node_exporter-*
./node_exporter &

# Verify metrics endpoint
curl http://localhost:9100/metrics | grep node_cpu_seconds
# Output:
# node_cpu_seconds_total{cpu="0",mode="idle"} 1234567.89
# node_cpu_seconds_total{cpu="0",mode="system"} 23456.78

PromQL Queries

PromQL is Prometheus’s query language for slicing metric data:

# CPU usage percentage per instance
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory available percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Request error rate (based on label status=5xx)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# Output: time-series data points at each scrape interval
# {instance="web-1:9100"} 23.5

Grafana Dashboards

Grafana visualizes Prometheus metrics in real-time dashboards:

{
  "title": "Production Overview",
  "panels": [
    {
      "title": "CPU by Instance",
      "type": "graph",
      "targets": [{
        "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
        "legendFormat": "{{ instance }}"
      }]
    },
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [{
        "expr": "rate(http_requests_total[5m])",
        "legendFormat": "requests/s"
      }]
    }
  ]
}

Output: Grafana renders panels as time-series graphs. You can add alert thresholds, annotations, and template variables for dynamic dashboards.

Alerting Rules

groups:
  - name: production_alerts
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU over 80% on {{ $labels.instance }}"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% on {{ $labels.instance }}"

Datadog Integration

Datadog is a SaaS monitoring platform with built-in integrations for hundreds of services:

# Install Datadog agent on a server
DD_API_KEY=your_api_key DD_SITE="datadoghq.com" bash -c \
  "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

# Configure integrations in /etc/datadog-agent/conf.d/
# Example: postgres.yaml
init_config:
instances:
  - host: localhost
    port: 5432
    username: datadog
    password: your_password

Output: Datadog automatically discovers running services and begins collecting metrics. The web UI shows pre-built dashboards for PostgreSQL, NGINX, AWS, and 600+ integrations.

Choosing the Right Tool

Tool	Type	Best For	Cost
Prometheus	Open-source metrics	Kubernetes, custom metrics, on-prem	Free
Grafana	Visualization	Dashboards, multi-source, alerting	Free / Cloud paid
Datadog	SaaS full-stack	All-in-one, teams, 600+ integrations	Per-host pricing
New Relic	SaaS APM	Application performance monitoring	Per-GB ingestion
Nagios	Legacy monitoring	Simple check-based monitoring	Free

Common Mistakes

Alert fatigue from poorly tuned thresholds: Setting alerts that fire too often trains teams to ignore them. Use for: clauses to require sustained violations, and set distinct severity levels.
Not using recording rules for expensive queries: Queries like histogram_quantile(0.99, ...) are slow. Pre-compute them with recording rules that run every scrape interval.
Scraping too frequently: A 1-second scrape interval on thousands of targets generates enormous data volume. 15-30 seconds is sufficient for most infrastructure metrics.
Ignoring label cardinality: Labels with many unique values (user IDs, email addresses) cause Prometheus to consume excessive memory. Keep cardinality under 100,000 per metric.
Not setting retention limits: Prometheus defaults to unlimited disk usage. Set --storage.tsdb.retention.time=30d and --storage.tsdb.retention.size=50GB.

Practice Questions

What are the three pillars of observability? Answer: Metrics (numbers over time), logs (event records), and traces (request paths through distributed systems).
How does Prometheus collect metrics? Answer: Prometheus uses a pull model — it scrapes HTTP endpoints (targets) at configured intervals. Exporters translate third-party system metrics into Prometheus format.
What is the difference between a counter and a gauge in Prometheus? Answer: A counter only increases (request count, errors). A gauge goes up and down (CPU usage, memory). Use rate() on counters, not gauges.
When should you choose Datadog over Prometheus + Grafana? Answer: When you want an all-in-one SaaS solution with 600+ integrations, built-in APM, and less operational overhead managing your own monitoring stack.

Challenge

Set up a complete monitoring stack: install Prometheus and node_exporter on three servers, configure Grafana with a dashboard showing CPU/memory/disk across all servers, create alerting rules for high CPU (>80% for 10m) and low disk (<10%), route alerts to Slack via Alertmanager.

FAQ

What is the difference between monitoring and observability?

: Monitoring tells you when something is wrong (known unknowns). Observability lets you ask why it’s wrong (unknown unknowns) — it requires metrics, logs, and traces together.

Can Prometheus monitor Docker containers?

: Yes. Use cAdvisor for container-level metrics or the Prometheus Docker SDK integration for service discovery of running containers.

How does Datadog pricing work?

: Datadog charges per host per month for infrastructure monitoring, plus additional costs for APM, logs, and advanced features. Pricing scales with volume.

What is the difference between Prometheus and InfluxDB?

: Prometheus is pull-based with a built-in alerting system. InfluxDB is push-based with a SQL-like query language (Flux). Prometheus is dominant for Kubernetes monitoring.

How do I handle high availability for Prometheus?

: Prometheus doesn’t natively cluster. For HA, run two identical Prometheus servers and use Thanos for global querying, long-term storage, and downsampling.

Mini Project: Monitor a Web Application

# custom_exporter.py
# Expose custom application metrics for Prometheus
from prometheus_client import start_http_server, Counter, Histogram
import time
import random

REQUEST_COUNT = Counter('app_requests_total', 'Total requests', ['endpoint', 'status'])
REQUEST_DURATION = Histogram('app_request_duration_seconds', 'Request duration', ['endpoint'])

def process_request(endpoint):
    with REQUEST_DURATION.labels(endpoint=endpoint).time():
        time.sleep(random.uniform(0.01, 0.5))
        status = random.choice(['200', '200', '200', '500', '404'])
        REQUEST_COUNT.labels(endpoint=endpoint, status=status).inc()
        return status

if __name__ == '__main__':
    start_http_server(8000)
    while True:
        process_request('/api/users')
        process_request('/api/items')
        time.sleep(1)

Expected output: Prometheus scrapes http://localhost:8000/metrics for app_requests_total and app_request_duration_seconds. Grafana visualizes request rate (via rate(app_requests_total[5m])) and latency percentiles (via histogram_quantile(0.99, ...)).

What’s Next

Topic	Description
Centralized Logging	Log aggregation with ELK and Loki
Prometheus & Grafana	Deeper dive into the Prometheus stack

Related topics: Prometheus, Grafana, Datadog, SLI

What’s Next

Congratulations on completing this Monitoring Tools tutorial! Here’s where to go from here:

Practice daily — Set up Prometheus monitoring on a personal project
Build a project — Create a Grafana dashboard for your application
Explore related topics — Check out centralized logging and SRE fundamentals

Remember: every expert was once a beginner. Keep coding!

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Previous Chef Guide — Infrastructure as Code with Cookbooks and Recipes Next Centralized Logging: ELK Stack, Loki, and Best Practices

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse DevOps & Cloud