Monitoring Tools: Prometheus, Grafana, Datadog & More
Monitoring tools collect metrics, logs, and traces from your infrastructure to detect issues, track performance, and provide observability into every layer of your stack.
What You’ll Learn
- Core monitoring concepts: metrics, logs, and traces
- Setting up Prometheus for time-series metrics collection
- Building Grafana dashboards for visualization
- Configuring Datadog for SaaS-based monitoring
- Writing alerting rules and defining SLI/SLO targets
Why Monitoring Tools Matter
Without monitoring, you’re flying blind. A server could be running at 100% CPU for hours before anyone notices — and by then, users have already left. Monitoring tools give you real-time visibility, historical trends, and automated alerts so you know about problems before your customers do. DodaTech uses Prometheus and Grafana to monitor Durga Antivirus Pro’s update servers — tracking request rates, error rates, and resource usage across a global fleet of distribution nodes.
flowchart LR
A[DevOps Basics] --> B[Monitoring Tools]
B --> C[Metrics - Prometheus]
B --> D[Logs - Elastic/Loki]
B --> E[Traces - Jaeger]
B --> F[Dashboards - Grafana]
B --> G[Alerting]
C --> H[Collect & Store]
D --> I[Aggregate & Search]
E --> J[Distributed Tracing]
style B fill:#e6522c,color:#fff
Core Monitoring Concepts
Monitoring is built on three pillars: metrics (numbers over time), logs (event records), and traces (request paths through distributed systems).
Metrics with Prometheus
Prometheus scrapes metrics from targets at configured intervals and stores them in a time-series database.
# Install and run Prometheus with node_exporter
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar -xzf prometheus-*.tar.gz && cd prometheus-*
# Start Prometheus
./prometheus --config.file=prometheus.yml --storage.tsdb.retention.time=30d &
# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-*.tar.gz && cd node_exporter-*
./node_exporter &
# Verify metrics endpoint
curl http://localhost:9100/metrics | grep node_cpu_seconds
# Output:
# node_cpu_seconds_total{cpu="0",mode="idle"} 1234567.89
# node_cpu_seconds_total{cpu="0",mode="system"} 23456.78PromQL Queries
PromQL is Prometheus’s query language for slicing metric data:
# CPU usage percentage per instance
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory available percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Request error rate (based on label status=5xx)
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
# Output: time-series data points at each scrape interval
# {instance="web-1:9100"} 23.5Grafana Dashboards
Grafana visualizes Prometheus metrics in real-time dashboards:
{
"title": "Production Overview",
"panels": [
{
"title": "CPU by Instance",
"type": "graph",
"targets": [{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}]
},
{
"title": "Request Rate",
"type": "graph",
"targets": [{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "requests/s"
}]
}
]
}Output: Grafana renders panels as time-series graphs. You can add alert thresholds, annotations, and template variables for dynamic dashboards.
Alerting Rules
groups:
- name: production_alerts
rules:
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "CPU over 80% on {{ $labels.instance }}"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% on {{ $labels.instance }}"Datadog Integration
Datadog is a SaaS monitoring platform with built-in integrations for hundreds of services:
# Install Datadog agent on a server
DD_API_KEY=your_api_key DD_SITE="datadoghq.com" bash -c \
"$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"
# Configure integrations in /etc/datadog-agent/conf.d/
# Example: postgres.yaml
init_config:
instances:
- host: localhost
port: 5432
username: datadog
password: your_passwordOutput: Datadog automatically discovers running services and begins collecting metrics. The web UI shows pre-built dashboards for PostgreSQL, NGINX, AWS, and 600+ integrations.
Choosing the Right Tool
| Tool | Type | Best For | Cost |
|---|---|---|---|
| Prometheus | Open-source metrics | Kubernetes, custom metrics, on-prem | Free |
| Grafana | Visualization | Dashboards, multi-source, alerting | Free / Cloud paid |
| Datadog | SaaS full-stack | All-in-one, teams, 600+ integrations | Per-host pricing |
| New Relic | SaaS APM | Application performance monitoring | Per-GB ingestion |
| Nagios | Legacy monitoring | Simple check-based monitoring | Free |
Common Mistakes
Alert fatigue from poorly tuned thresholds: Setting alerts that fire too often trains teams to ignore them. Use
for:clauses to require sustained violations, and set distinct severity levels.Not using recording rules for expensive queries: Queries like
histogram_quantile(0.99, ...)are slow. Pre-compute them with recording rules that run every scrape interval.Scraping too frequently: A 1-second scrape interval on thousands of targets generates enormous data volume. 15-30 seconds is sufficient for most infrastructure metrics.
Ignoring label cardinality: Labels with many unique values (user IDs, email addresses) cause Prometheus to consume excessive memory. Keep cardinality under 100,000 per metric.
Not setting retention limits: Prometheus defaults to unlimited disk usage. Set
--storage.tsdb.retention.time=30dand--storage.tsdb.retention.size=50GB.
Practice Questions
What are the three pillars of observability? Answer: Metrics (numbers over time), logs (event records), and traces (request paths through distributed systems).
How does Prometheus collect metrics? Answer: Prometheus uses a pull model — it scrapes HTTP endpoints (targets) at configured intervals. Exporters translate third-party system metrics into Prometheus format.
What is the difference between a counter and a gauge in Prometheus? Answer: A counter only increases (request count, errors). A gauge goes up and down (CPU usage, memory). Use
rate()on counters, not gauges.When should you choose Datadog over Prometheus + Grafana? Answer: When you want an all-in-one SaaS solution with 600+ integrations, built-in APM, and less operational overhead managing your own monitoring stack.
Challenge
Set up a complete monitoring stack: install Prometheus and node_exporter on three servers, configure Grafana with a dashboard showing CPU/memory/disk across all servers, create alerting rules for high CPU (>80% for 10m) and low disk (<10%), route alerts to Slack via Alertmanager.
FAQ
Mini Project: Monitor a Web Application
# custom_exporter.py
# Expose custom application metrics for Prometheus
from prometheus_client import start_http_server, Counter, Histogram
import time
import random
REQUEST_COUNT = Counter('app_requests_total', 'Total requests', ['endpoint', 'status'])
REQUEST_DURATION = Histogram('app_request_duration_seconds', 'Request duration', ['endpoint'])
def process_request(endpoint):
with REQUEST_DURATION.labels(endpoint=endpoint).time():
time.sleep(random.uniform(0.01, 0.5))
status = random.choice(['200', '200', '200', '500', '404'])
REQUEST_COUNT.labels(endpoint=endpoint, status=status).inc()
return status
if __name__ == '__main__':
start_http_server(8000)
while True:
process_request('/api/users')
process_request('/api/items')
time.sleep(1)Expected output: Prometheus scrapes http://localhost:8000/metrics for app_requests_total and app_request_duration_seconds. Grafana visualizes request rate (via rate(app_requests_total[5m])) and latency percentiles (via histogram_quantile(0.99, ...)).
What’s Next
| Topic | Description |
|---|---|
| Log aggregation with ELK and Loki | |
| Deeper dive into the Prometheus stack |
Related topics: Prometheus, Grafana, Datadog, SLI
What’s Next
Congratulations on completing this Monitoring Tools tutorial! Here’s where to go from here:
- Practice daily — Set up Prometheus monitoring on a personal project
- Build a project — Create a Grafana dashboard for your application
- Explore related topics — Check out centralized logging and SRE fundamentals
Remember: every expert was once a beginner. Keep coding!
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro