Prometheus and Grafana Guide — Monitoring and Observability
Prometheus is an open-source monitoring system that collects time-series metrics from your infrastructure, while Grafana turns those metrics into dashboards and alerts — together forming the industry-standard stack for observability.
What You’ll Learn
- Installing and configuring Prometheus and node_exporter
- Collecting system metrics with Prometheus exporters
- Querying metrics with PromQL for insights
- Building Grafana dashboards from Prometheus data
- Setting up alerting rules in Prometheus and Grafana
Why Prometheus and Grafana Matter
When a server goes down or an API starts responding slowly, you need to know immediately — and you need data to find the root cause. Prometheus scrapes metrics on a schedule and stores them efficiently, while Grafana visualizes those metrics in real-time dashboards. Together, they replace expensive proprietary monitoring tools with a flexible, open-source stack that scales from a single server to a global infrastructure. DodaTech uses Prometheus and Grafana to monitor Durga Antivirus Pro’s update servers — tracking request rates, error rates, and server resource usage across a global fleet of update distribution nodes.
flowchart LR
A[Bash & Linux Basics] --> B[Prometheus & Grafana]
B --> C[Prometheus Server]
B --> D[Exporters]
B --> E[PromQL]
B --> F[Grafana Dashboards]
C --> G[Time-Series DB]
D --> H[Metrics Collection]
E --> I[Querying & Aggregation]
F --> J[Visualization & Alerting]
style B fill:#e6522c,color:#fff
Core Concepts
Installing the Stack
# Install Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar -xzf prometheus-*.tar.gz
cd prometheus-*
./prometheus --config.file=prometheus.yml &
# Install node_exporter (system metrics)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-*.tar.gz
cd node_exporter-*
./node_exporter &
# node_exporter exposes metrics at http://localhost:9100/metrics
curl http://localhost:9100/metrics | head -20
# Output:
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode
# TYPE node_cpu_seconds_total counter
# node_cpu_seconds_total{cpu="0",mode="idle"} 1234567.89Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s # How often to scrape metrics
evaluation_interval: 15s # How often to evaluate rules
# Targets to scrape
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node"
static_configs:
- targets:
- "server1:9100"
- "server2:9100"
- "server3:9100"
- job_name: "api"
metrics_path: "/metrics"
static_configs:
- targets: ["api.example.com:3000"]PromQL Queries
PromQL is the query language for Prometheus:
# CPU usage (percentage) — rate of non-idle CPU time
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Output:
# {instance="server1:9100"} 23.45
# {instance="server2:9100"} 12.78
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk space remaining (percentage)
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
# HTTP request rate (over last 5 minutes)
rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# API error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100Output: PromQL returns time-series data that Grafana visualizes as graphs, gauges, and tables. The queries aggregate across instances and time windows.
Grafana Dashboards
# Install Grafana
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
sudo apt-get update && sudo apt-get install grafana
# Start Grafana
sudo systemctl start grafana-server
# Access at http://localhost:3000 (admin/admin)// Grafana dashboard JSON model (simplified)
{
"title": "Server Overview",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}]
},
{
"title": "Memory Available",
"type": "gauge",
"targets": [{
"expr": "(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100"
}]
},
{
"title": "Disk Space",
"type": "stat",
"targets": [{
"expr": "node_filesystem_avail_bytes{mountpoint=\"/\"} / 1024 / 1024 / 1024"
}]
}
]
}Output: Grafana renders panels from PromQL queries. Common panel types: time-series graphs, gauges (single values), stats (big numbers), tables, and heatmaps.
Alerting
# prometheus-alert-rules.yml
groups:
- name: server_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage above 80% on {{ $labels.instance }}"
description: "CPU has been above 80% for more than 5 minutes (value: {{ $value }})"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 2m
labels:
severity: critical
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"# Configure alertmanager.yml
route:
receiver: "team-email"
routes:
- match:
severity: critical
receiver: "team-pager"
receivers:
- name: "team-email"
email_configs:
- to: "ops@example.com"
from: "alert@example.com"
- name: "team-pager"
slack_configs:
- api_url: "https://hooks.slack.com/services/..."
channel: "#alerts"Common Mistakes
Scraping too frequently: Setting
scrape_intervalto 1s generates enormous data volume and may overload targets. 15-30s is sufficient for most infrastructure metrics.Not creating recording rules for complex queries: Running
histogram_quantile(0.95, ...)on every dashboard refresh is slow. Create recording rules that pre-compute expensive queries every scrape interval.Storing high-cardinality labels: Labels with many unique values (user_id, IP address, email) cause Prometheus to consume excessive memory. Keep label cardinality under 100,000.
Forgetting to set retention and disk limits: Prometheus stores data on disk indefinitely by default. Set
--storage.tsdb.retention.time=30dand--storage.tsdb.retention.size=50GBto prevent disk exhaustion.Ignoring alert fatigue: Too many alerts with unclear severity cause teams to ignore them. Use distinct severity levels (info, warning, critical) and only page for critical alerts.
Practice Questions
What is the difference between Prometheus and Grafana? Answer: Prometheus collects and stores time-series metrics and evaluates alert rules. Grafana visualizes metrics from Prometheus (and other data sources) in dashboards and manages alert notifications.
What is a Prometheus exporter? Answer: An exporter is a service that translates metrics from a third-party system into Prometheus format.
node_exporterexposes Linux system metrics, and there are exporters for databases, web servers, and cloud services.How does PromQL handle rate calculations? Answer:
rate(counter[5m])calculates the per-second average rate of increase over the last 5 minutes. It handles counter resets (e.g., server restart) automatically.What is cardinality and why does it matter? Answer: Cardinality is the number of unique label combinations for a metric. High cardinality (millions of combinations) causes Prometheus to use excessive memory and slow down queries.
Challenge
Monitor a web application: set up Prometheus to scrape node_exporter and a custom HTTP metrics endpoint, create a Grafana dashboard showing CPU, memory, disk, and request rate, write alerting rules for high CPU (>80%) and low disk (<10%), configure Alertmanager to send Slack notifications, and stress-test the system to trigger alerts.
FAQ
Try It Yourself
# Run Prometheus and node_exporter locally
# Terminal 1:
./node_exporter &
# Terminal 2:
./prometheus --config.file=prometheus.yml &
# Terminal 3: Install Grafana and add Prometheus datasource
# http://localhost:3000 -> Add data source -> Prometheus -> http://localhost:9090
# Import a pre-built dashboard (Node Exporter Full - ID 1860)
# Explore PromQL in the Explore tabWhat’s Next
| Topic | Description |
|---|---|
| The OS you’ll monitor with this stack | |
| Run the monitoring stack in containers |
Related topics: Bash, Linux, Docker, AWS
What’s Next
Congratulations on completing this Prometheus & Grafana tutorial! Here’s where to go from here:
- Practice daily — Consistency is more important than long study sessions
- Build a project — Apply what you learned by building something real
- Explore related topics — Check out other tutorials in the same category
- Join the community — Discuss with other learners and share your progress
Remember: every expert was once a beginner. Keep coding!
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro