Learn DevOps: Prometheus and Grafana Guide — Monitoring and Observability

Q: Is Prometheus suitable for log data?

: Prometheus is designed for metrics (numbers), not logs. Use Elasticsearch/Loki for log aggregation and Prometheus for metrics.

Q: Can Prometheus monitor Docker containers?

: Yes. Use cAdvisor for container-level metrics or the Prometheus Docker SDK integration for service discovery.

Q: How does Prometheus handle high availability?

: Prometheus doesn’t natively support clustering. For HA, run two identical Prometheus servers in parallel and use a load balancer. Grafana can query either.

Q: What is Thanos?

: Thanos extends Prometheus with global querying, unlimited retention, and downsampling across multiple Prometheus instances. It’s used for large-scale deployments.

Q: Can I use Grafana without Prometheus?

: Yes. Grafana supports many data sources: InfluxDB, Graphite, Elasticsearch, MySQL, PostgreSQL, CloudWatch, Azure Monitor, and more.

DevOps & Cloud

Prometheus and Grafana Guide — Monitoring and Observability

DodaTech Updated Jun 7, 2026 6 min read

Prometheus is an open-source monitoring system that collects time-series metrics from your infrastructure, while Grafana turns those metrics into dashboards and alerts — together forming the industry-standard stack for observability.

What You’ll Learn

Installing and configuring Prometheus and node_exporter
Collecting system metrics with Prometheus exporters
Querying metrics with PromQL for insights
Building Grafana dashboards from Prometheus data
Setting up alerting rules in Prometheus and Grafana

Why Prometheus and Grafana Matter

When a server goes down or an API starts responding slowly, you need to know immediately — and you need data to find the root cause. Prometheus scrapes metrics on a schedule and stores them efficiently, while Grafana visualizes those metrics in real-time dashboards. Together, they replace expensive proprietary monitoring tools with a flexible, open-source stack that scales from a single server to a global infrastructure. DodaTech uses Prometheus and Grafana to monitor Durga Antivirus Pro’s update servers — tracking request rates, error rates, and server resource usage across a global fleet of update distribution nodes.

    flowchart LR
    A[Bash & Linux Basics] --> B[Prometheus & Grafana]
    B --> C[Prometheus Server]
    B --> D[Exporters]
    B --> E[PromQL]
    B --> F[Grafana Dashboards]
    C --> G[Time-Series DB]
    D --> H[Metrics Collection]
    E --> I[Querying & Aggregation]
    F --> J[Visualization & Alerting]
    style B fill:#e6522c,color:#fff

Prerequisites: Basic Bash and Linux command-line skills. Familiarity with server administration and networking concepts.

Core Concepts

Installing the Stack

# Install Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar -xzf prometheus-*.tar.gz
cd prometheus-*
./prometheus --config.file=prometheus.yml &

# Install node_exporter (system metrics)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-*.tar.gz
cd node_exporter-*
./node_exporter &

# node_exporter exposes metrics at http://localhost:9100/metrics
curl http://localhost:9100/metrics | head -20

# Output:
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode
# TYPE node_cpu_seconds_total counter
# node_cpu_seconds_total{cpu="0",mode="idle"} 1234567.89

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s      # How often to scrape metrics
  evaluation_interval: 15s  # How often to evaluate rules

# Targets to scrape
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets:
          - "server1:9100"
          - "server2:9100"
          - "server3:9100"

  - job_name: "api"
    metrics_path: "/metrics"
    static_configs:
      - targets: ["api.example.com:3000"]

PromQL Queries

PromQL is the query language for Prometheus:

# CPU usage (percentage) — rate of non-idle CPU time
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Output: 
# {instance="server1:9100"} 23.45
# {instance="server2:9100"} 12.78

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk space remaining (percentage)
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# HTTP request rate (over last 5 minutes)
rate(http_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# API error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

Output: PromQL returns time-series data that Grafana visualizes as graphs, gauges, and tables. The queries aggregate across instances and time windows.

Grafana Dashboards

# Install Grafana
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
sudo apt-get update && sudo apt-get install grafana

# Start Grafana
sudo systemctl start grafana-server

# Access at http://localhost:3000 (admin/admin)

// Grafana dashboard JSON model (simplified)
{
  "title": "Server Overview",
  "panels": [
    {
      "title": "CPU Usage",
      "type": "graph",
      "targets": [{
        "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
        "legendFormat": "{{ instance }}"
      }]
    },
    {
      "title": "Memory Available",
      "type": "gauge",
      "targets": [{
        "expr": "(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100"
      }]
    },
    {
      "title": "Disk Space",
      "type": "stat",
      "targets": [{
        "expr": "node_filesystem_avail_bytes{mountpoint=\"/\"} / 1024 / 1024 / 1024"
      }]
    }
  ]
}

Output: Grafana renders panels from PromQL queries. Common panel types: time-series graphs, gauges (single values), stats (big numbers), tables, and heatmaps.

Alerting

# prometheus-alert-rules.yml
groups:
  - name: server_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 80% on {{ $labels.instance }}"
          description: "CPU has been above 80% for more than 5 minutes (value: {{ $value }})"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

# Configure alertmanager.yml
route:
  receiver: "team-email"
  routes:
    - match:
        severity: critical
      receiver: "team-pager"

receivers:
  - name: "team-email"
    email_configs:
      - to: "ops@example.com"
        from: "alert@example.com"
  - name: "team-pager"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/..."
        channel: "#alerts"

Common Mistakes

Scraping too frequently: Setting scrape_interval to 1s generates enormous data volume and may overload targets. 15-30s is sufficient for most infrastructure metrics.
Not creating recording rules for complex queries: Running histogram_quantile(0.95, ...) on every dashboard refresh is slow. Create recording rules that pre-compute expensive queries every scrape interval.
Storing high-cardinality labels: Labels with many unique values (user_id, IP address, email) cause Prometheus to consume excessive memory. Keep label cardinality under 100,000.
Forgetting to set retention and disk limits: Prometheus stores data on disk indefinitely by default. Set --storage.tsdb.retention.time=30d and --storage.tsdb.retention.size=50GB to prevent disk exhaustion.
Ignoring alert fatigue: Too many alerts with unclear severity cause teams to ignore them. Use distinct severity levels (info, warning, critical) and only page for critical alerts.

Practice Questions

What is the difference between Prometheus and Grafana? Answer: Prometheus collects and stores time-series metrics and evaluates alert rules. Grafana visualizes metrics from Prometheus (and other data sources) in dashboards and manages alert notifications.
What is a Prometheus exporter? Answer: An exporter is a service that translates metrics from a third-party system into Prometheus format. node_exporter exposes Linux system metrics, and there are exporters for databases, web servers, and cloud services.
How does PromQL handle rate calculations? Answer: rate(counter[5m]) calculates the per-second average rate of increase over the last 5 minutes. It handles counter resets (e.g., server restart) automatically.
What is cardinality and why does it matter? Answer: Cardinality is the number of unique label combinations for a metric. High cardinality (millions of combinations) causes Prometheus to use excessive memory and slow down queries.

Challenge

Monitor a web application: set up Prometheus to scrape node_exporter and a custom HTTP metrics endpoint, create a Grafana dashboard showing CPU, memory, disk, and request rate, write alerting rules for high CPU (>80%) and low disk (<10%), configure Alertmanager to send Slack notifications, and stress-test the system to trigger alerts.

FAQ

Is Prometheus suitable for log data?

: Prometheus is designed for metrics (numbers), not logs. Use Elasticsearch/Loki for log aggregation and Prometheus for metrics.

Can Prometheus monitor Docker containers?

: Yes. Use cAdvisor for container-level metrics or the Prometheus Docker SDK integration for service discovery.

How does Prometheus handle high availability?

: Prometheus doesn’t natively support clustering. For HA, run two identical Prometheus servers in parallel and use a load balancer. Grafana can query either.

What is Thanos?

: Thanos extends Prometheus with global querying, unlimited retention, and downsampling across multiple Prometheus instances. It’s used for large-scale deployments.

Can I use Grafana without Prometheus?

: Yes. Grafana supports many data sources: InfluxDB, Graphite, Elasticsearch, MySQL, PostgreSQL, CloudWatch, Azure Monitor, and more.

Try It Yourself

# Run Prometheus and node_exporter locally
# Terminal 1:
./node_exporter &

# Terminal 2:
./prometheus --config.file=prometheus.yml &

# Terminal 3: Install Grafana and add Prometheus datasource
# http://localhost:3000 -> Add data source -> Prometheus -> http://localhost:9090

# Import a pre-built dashboard (Node Exporter Full - ID 1860)
# Explore PromQL in the Explore tab

What’s Next

Topic	Description
Linux	The OS you’ll monitor with this stack
Docker Compose	Run the monitoring stack in containers

Related topics: Bash, Linux, Docker, AWS

What’s Next

Congratulations on completing this Prometheus & Grafana tutorial! Here’s where to go from here:

Practice daily — Consistency is more important than long study sessions
Build a project — Apply what you learned by building something real
Explore related topics — Check out other tutorials in the same category
Join the community — Discuss with other learners and share your progress

Remember: every expert was once a beginner. Keep coding!

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Previous Terraform Guide — Infrastructure as Code with HCL Next GitHub Actions Guide — CI/CD Workflows for Modern Development

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse DevOps & Cloud