Monitoring & Alerting Automation — Build Smart Notification Systems

DodaTech Updated 2026-06-22 8 min read

In this tutorial, you'll learn about Monitoring & Alerting Automation. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Build automated monitoring and alerting systems: configure health checks, set up intelligent notifications, integrate with Slack and email, and reduce alert fatigue with actionable examples.

What You'll Learn

You will learn to design monitoring systems that check application health, database connectivity, and resource usage, then route alerts through the right channels with appropriate severity levels -- ensuring your team gets notified without being overwhelmed.

Why It Matters

Without automated monitoring, you discover outages when users email you. Alerting automation detects problems within seconds, notifies the right person, and provides context for faster remediation. A well-designed alerting system reduces mean time to detection (MTTD) from hours to seconds.

Real-World Use

The Doda Browser backend infrastructure uses a multi-layered monitoring Stack. If the search API response time exceeds 500ms for more than 30 seconds, the system sends a Slack alert with the current latency Graph and the affected region. If the error rate exceeds 5%, it pages the on-call engineer via PagerDuty. This automation has reduced outage detection time from 15 minutes to under 30 seconds.

Your Learning Path

flowchart LR
  A[Python Automation Scripts] --> B[Monitoring Alerting]
  B --> C[Infrastructure Automation]
  C --> D[AI Code Generation]
  B --> F{You Are Here}
  style F fill:#f90,color:#fff

ℹ Info

Prerequisites: Familiarity with Python scripting and HTTP APIs. Understanding of basic system administration concepts like processes, ports, and resource usage.

Health Check Automation

The foundation of monitoring is regularly checking that services are alive and responding correctly.

import requests
import time
from datetime import datetime

SERVICES = {
    "web-app": "HTTPS://app.example.com/health",
    "API": "HTTPS://API.example.com/health",
    "database": "HTTPS://db.internal.example.com/health",
}

def check_service(name, URL, timeout=10):
    """Check a single service health endpoint."""
    start = time.time()
    try:
        response = requests.get(URL, timeout=timeout)
        elapsed = time.time() - start
        status = "UP" if response.status_code == 200 else "DEGRADED"
        return {
            "service": name,
            "status": status,
            "status_code": response.status_code,
            "response_time_ms": round(elapsed * 1000),
            "timestamp": datetime.utcnow().isoformat(),
        }
    except requests.exceptions.RequestException as e:
        return {
            "service": name,
            "status": "DOWN",
            "error": str(e),
            "timestamp": datetime.utcnow().isoformat(),
        }

def run_health_checks():
    results = []
    for name, URL in SERVICES.items():
        result = check_service(name, URL)
        results.append(result)
        icon = {"UP": "OK", "DEGRADED": "WARN", "DOWN": "FAIL"}
        print(f"[{icon[result['status']]}] {name}: {result['status']}")
    return results

results = run_health_checks()

Expected output:

[OK] web-app: UP
[OK] api: UP
[FAIL] database: DOWN

Intelligent Alert Routing

Different problems need different notification channels. A critical outage pages the on-call engineer. A minor warning sends a Slack message. Informational messages Go to email.

import smtplib
import json
from email.message import EmailMessage
from datetime import datetime

ALERT_CONFIG = {
    "critical": {"channels": ["pagerduty", "slack", "email"], "repeat_interval": 300},
    "warning": {"channels": ["slack", "email"], "repeat_interval": 1800},
    "info": {"channels": ["email"], "repeat_interval": 86400},
}

def send_slack_alert(webhook_url, message, severity):
    """Send alert to Slack via webhook."""
    color = {"critical": "danger", "warning": "warning", "info": "good"}
    payload = {
        "attachments": [{
            "color": color.get(severity, "good"),
            "title": f"[{severity.upper()}] {message['title']}",
            "text": message["body"],
            "fields": [
                {"title": "Service", "value": message.get("service", "N/A"), "short": True},
                {"title": "Time", "value": datetime.utcnow().isoformat(), "short": True},
            ],
        }]
    }
    response = requests.post(
        webhook_url,
        json=payload,
        timeout=10
    )
    return response.status_code == 200

def send_email_alert(recipient, subject, body):
    """Send alert email."""
    msg = EmailMessage()
    msg["Subject"] = f"[ALERT] {subject}"
    msg["From"] = "alerts@example.com"
    msg["To"] = recipient
    msg.set_content(body)

    with smtplib.SMTP("smtp.example.com", 587) as server:
        server.starttls()
        server.login("alerts@example.com", "password")
        server.send_message(msg)

def route_alert(alert_data):
    """Route alert to appropriate channels based on severity."""
    severity = alert_data.get("severity", "info")
    config = ALERT_CONFIG.get(severity, ALERT_CONFIG["info"])

    for channel in config["channels"]:
        if channel == "slack":
            send_slack_alert(
                "https://hooks.slack.com/services/xxx",
                alert_data,
                severity
            )
        elif channel == "email":
            send_email_alert(
                "team@example.com",
                alert_data["title"],
                alert_data["body"]
            )
        elif channel == "pagerduty":
            # PagerDuty API integration
            print(f"PAGERDUTY: {alert_data['title']}")

    print(f"Alert routed: severity={severity}, channels={config['channels']}")

# Example: route a critical alert
route_alert({
    "severity": "critical",
    "title": "Database connection pool exhausted",
    "body": "Connection pool at 98% capacity on db-primary. "
            "Active connections: 196/200. Consider scaling up.",
    "service": "database-primary",
})

Expected behavior: The critical alert is sent to PagerDuty (triggering a phone call), posted to a Slack channel with red indicator, and emailed to the team. LESS severe alerts skip the pager.

Resource Monitoring Script

import psutil
import json
from datetime import datetime

def collect_system_metrics():
    """Collect key system health metrics."""
    cpu_percent = psutil.cpu_percent(interval=1)
    memory = psutil.virtual_memory()
    disk = psutil.disk_usage("/")

    metrics = {
        "timestamp": datetime.utcnow().isoformat(),
        "cpu": {
            "percent": cpu_percent,
            "count": psutil.cpu_count(),
        },
        "memory": {
            "total_gb": round(memory.total / (1024**3), 2),
            "available_gb": round(memory.available / (1024**3), 2),
            "percent": memory.percent,
        },
        "disk": {
            "total_gb": round(disk.total / (1024**3), 2),
            "free_gb": round(disk.free / (1024**3), 2),
            "percent": disk.percent,
        },
        "load_avg": psutil.getloadavg(),
    }

    print(json.dumps(metrics, indent=2))
    return metrics

def check_thresholds(metrics):
    """Evaluate metrics against alert thresholds."""
    alerts = []

    if metrics["cpu"]["percent"] > 90:
        alerts.append({
            "severity": "warning",
            "title": "High CPU usage",
            "body": f"CPU at {metrics['cpu']['percent']}%",
        })

    if metrics["memory"]["percent"] > 90:
        alerts.append({
            "severity": "critical",
            "title": "Critical memory usage",
            "body": f"Memory at {metrics['memory']['percent']}%",
        })

    if metrics["disk"]["percent"] > 85:
        alerts.append({
            "severity": "warning",
            "title": "Disk space running low",
            "body": f"Disk at {metrics['disk']['percent']}%",
        })

    return alerts

metrics = collect_system_metrics()
alerts = check_thresholds(metrics)
for alert in alerts:
    route_alert(alert)

Expected output: The script prints the full metrics JSON, then routes any alerts that cross the defined thresholds.

De-duplication and Alert Fatigue Prevention

Alert fatigue happens when teams receive too many notifications and start ignoring them. De-duplication solves this.

import time
from collections import defaultdict

class AlertManager:
    """Manages alert de-duplication and throttling."""

    def __init__(self):
        self.recent_alerts = defaultdict(list)
        self.cooldown = {
            "critical": 300,
            "warning": 1800,
            "info": 86400,
        }

    def should_send(self, alert):
        """Check if an alert should be sent or suppressed."""
        key = f"{alert['service']}:{alert['title']}"
        now = time.time()
        cooldown = self.cooldown.get(alert.get("severity", "info"), 3600)

        # Remove old entries
        self.recent_alerts[key] = [
            t for t in self.recent_alerts[key]
            if now - t < cooldown
        ]

        if self.recent_alerts[key]:
            return False

        self.recent_alerts[key].append(now)
        return True

manager = AlertManager()

# Simulate repeated alerts
for i in range(5):
    alert = {
        "severity": "warning",
        "service": "web-app",
        "title": "Response time > 2s",
        "body": f"Check #{i + 1}",
    }
    if manager.should_send(alert):
        print(f"SENT: {alert['title']}")
    else:
        print(f"SUPPRESSED: {alert['title']} (duplicate)")
    time.sleep(2)

Expected output:

SENT: Response time > 2s
SUPPRESSED: Response time > 2s (duplicate)
SUPPRESSED: Response time > 2s (duplicate)

Only the first alert is sent. Duplicates within the cooldown window are suppressed.

Common Monitoring and Alerting Mistakes

1. Alerting Without Context

A message like "Server down" gives no information about what is affected or how to fix it. Include the service name, error details, affected users, timestamps, and a link to the runbook.

2. No De-duplication

Without de-duplication, a single flapping service can send hundreds of alerts per minute. Implement cooldown periods and group related alerts.

3. Alerting on Symptoms Instead of Causes

Alerting on "high CPU" is a symptom. The real cause might be a memory leak, a traffic surge, or a problematic Cron job. Include root cause investigation in your runbooks.

4. Too Many Low-Severity Alerts

If every minor warning goes to the same channel, critical alerts get buried. Route by severity and let the team configure their notification preferences.

5. No Escalation Path

What happens if the on-call engineer does not acknowledge an alert within 15 minutes? Define an escalation path: primary -> secondary -> manager -> incident commander.

6. Testing Only on Healthy Systems

Monitoring that has never been tested during an actual outage will fail when you need it most. Conduct Chaos Engineering exercises and scheduled failover drills.

7. Hardcoding Alert Recipients

When team members join or leave, hardcoded email addresses become stale. Use a team roster from your HR system or a configuration file.

Practice Questions

1. What is alert fatigue and how do you prevent it? Alert fatigue occurs when teams receive too many notifications and start ignoring them. Prevent it with de-duplication, severity-based routing, appropriate cooldown periods, and eliminating noisy alerts.

2. What three channels should a critical alert use? PagerDuty (or similar) for immediate notification, Slack for team visibility, and email for documentation. Critical alerts must wake someone up.

3. Why should you include context in alerts? Context tells the responder what is affected, why it matters, and where to start fixing it. Without context, responders waste time investigating before they can act.

4. What is an escalation path in alerting? A defined sequence of people to notify if the primary responder does not acknowledge an alert within a specified time. Typically: primary -> secondary -> manager.

5. Challenge: Build a monitoring script that checks three services, sends a Slack alert if any are down, implements 5-minute de-duplication, and logs all checks to a file with timestamps.

Mini Project: Multi-Service Health Dashboard

Build a Python script that monitors at least five services (web, API, database, cache, Queue), checks them every 60 seconds, maintains an in-memory status history, sends Slack alerts with severity routing, logs every check to a JSON file, and exposes a simple HTTP endpoint returning the current status for an external dashboard.

import time
import JSON
from datetime import datetime
from pathlib import Path

class HealthMonitor:
    def __init__(self, check_interval=60):
        self.services = {}
        self.history = []
        self.interval = check_interval

    def Register(self, name, check_fn, severity="warning"):
        self.services[name] = {"check": check_fn, "severity": severity}

    def check_all(self):
        results = {}
        for name, config in self.services.items():
            try:
                status = config["check"]()
                results[name] = {"status": "UP", "data": status}
            except Exception as e:
                results[name] = {"status": "DOWN", "error": str(e)}
        return results

    def run(self, duration=None):
        start = time.time()
        while True:
            if duration and time.time() - start > duration:
                break
            results = self.check_all()
            entry = {
                "timestamp": datetime.utcnow().isoformat(),
                "results": results,
            }
            self.history.append(entry)
            for name, result in results.items():
                print(f"[{result['status']}] {name}")
            time.sleep(self.interval)

# Example usage
monitor = HealthMonitor(check_interval=10)
# Register service checks...
monitor.run(duration=30)

Expected behavior: The monitor runs for 30 seconds, checking all registered services every 10 seconds, and printing the status of each.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Workflow Automation with Python Scripts — Automate Repetitive Tasks Next → Infrastructure Automation — Ansible, Terraform & Modern IaC

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation