Learn DevOps: Incident Response: Handling Production Outages

Q: How do I prevent alert fatigue?

: Use severity levels, tune thresholds with for: clauses, deduplicate related alerts, and suppress known issues. Only page for SEV-1 and SEV-2. Create tickets for SEV-3+.

DevOps & Cloud

Incident Response: Handling Production Outages

DodaTech Updated Jun 20, 2026 8 min read

Incident response is the structured process of detecting, responding to, and recovering from production outages — minimizing downtime, communicating clearly, and preventing recurrence through blameless learning.

What You’ll Learn

Incident response process: detection, triage, mitigation, resolution, follow-up
Severity levels, on-call rotations, and escalation policies
Writing effective runbooks and conducting blameless postmortems
Setting up PagerDuty and Opsgenie for alerting and on-call scheduling

Why Incident Response Matters

Without a structured incident response process, an outage becomes chaos — multiple people SSH into the same server, no one knows who’s coordinating, communication is fragmented across Slack DMs, and the fix takes 3x longer than necessary. A good incident response process cuts Mean Time to Recovery (MTTR) by 50-80%. DodaTech uses PagerDuty for on-call scheduling and incident routing for Durga Antivirus Pro’s production infrastructure — every alert reaches the right person with context about the affected system.

    flowchart LR
    A[SRE Principles] --> B[Incident Response]
    B --> C[Detection]
    B --> D[Triage]
    B --> E[Mitigation]
    B --> F[Resolution]
    B --> G[Postmortem]
    C --> H[Alerts / Monitoring]
    D --> I[Severity Assessment]
    E --> J[Rollback / Fix]
    style B fill:#e53e3e,color:#fff

Prerequisites: Understanding of SRE principles and monitoring basics. Familiarity with Linux and Bash.

Incident Response Process

Severity Levels

Severity	Definition	Response Time	Example
SEV-1	Complete outage, all users affected	5 min	Database primary down
SEV-2	Major feature degraded, partial outage	15 min	API latency > 5s
SEV-3	Minor issue, workaround available	1 hour	UI display bug
SEV-4	Cosmetic, no user impact	Next business day	Documentation error

Incident Commander Role

The Incident Commander (IC) coordinates the response and does NOT debug:

# incident_commander.py
class Incident:
    def __init__(self, severity, service, description):
        self.severity = severity
        self.service = service
        self.description = description
        self.timeline = []
        self.ic = None
        self.responders = []
        self.status = "detected"

    def assign_ic(self, name):
        self.ic = name
        self.timeline.append(f"IC assigned: {name}")

    def add_responder(self, name, role):
        self.responders.append({"name": name, "role": role})
        self.timeline.append(f"Responder added: {name} ({role})")

    def update_status(self, status, message):
        self.status = status
        self.timeline.append(f"[{status.upper()}] {message}")

    def summary(self):
        print(f"=== Incident Summary ===")
        print(f"Severity: {self.severity}")
        print(f"Service: {self.service}")
        print(f"Description: {self.description}")
        print(f"IC: {self.ic}")
        print(f"Status: {self.status}")
        for entry in self.timeline:
            print(f"  {entry}")

incident = Incident("SEV-1", "api-gateway", "All requests returning 502")
incident.assign_ic("Alice")
incident.add_responder("Bob", "Networking")
incident.add_responder("Carol", "Backend")
incident.update_status("mitigating", "Rolling back last deployment")
incident.update_status("resolved", "Rollback complete, traffic restored")
incident.summary()

Expected output:

=== Incident Summary ===
Severity: SEV-1
Service: api-gateway
Description: All requests returning 502
IC: Alice
Status: resolved
  IC assigned: Alice
  Responder added: Bob (Networking)
  Responder added: Carol (Backend)
  [MITIGATING] Rolling back last deployment
  [RESOLVED] Rollback complete, traffic restored

Communication Templates

# INCIDENT NOTIFICATION
# --------------------
# Severity: SEV-1
# Service: api-gateway
# Start Time: 2024-06-20 10:30 UTC
# Summary: All requests returning 502
# Impact: 100% of users affected
# IC: @alice
# Responders: @bob (networking), @carol (backend)
# Status: Investigation ongoing
# Updates: Every 15 minutes in #incident-api-gateway

# RESOLUTION POSTMORTEM
# --------------------
# What happened: v2.3.1 deployment included a misconfigured upstream
# Root cause: Config change in nginx.conf removed proxy_pass directive
# Detection: Datadog alert "Gateway Error Rate > 1%" fired
# Resolution: Rolled back to v2.3.0
# Time to detection: 2 minutes
# Time to resolution: 12 minutes

On-Call Rotations

Setting Up PagerDuty

# PagerDuty best practices:
# 1. Create schedules with follow-the-sun rotation
#    - US team: 9am-5pm EST
#    - EU team: 9am-5pm CET
#    - APAC team: 9am-5pm JST
# 2. Set up escalation policies:
#    - Level 1: Primary on-call (15 min)
#    - Level 2: Secondary on-call (15 min)
#    - Level 3: Engineering manager
# 3. Configure service integrations:
#    - Prometheus Alertmanager → PagerDuty webhook
#    - Datadog → PagerDuty direct integration

# Alertmanager config for PagerDuty
route:
  receiver: pagerduty-prod
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: pagerduty-prod
    pagerduty_configs:
      - routing_key: "<your-pagerduty-integration-key>"
        severity: "{{ .CommonLabels.severity }}"

Runbook Structure

Every service must have a runbook:

# Runbook: api-gateway

## Symptoms
- 502 Bad Gateway
- Latency > 5s
- Connection refused

## Initial Checks
1. Check Datadog dashboard: "API Gateway Overview"
2. Check upstream service health
3. Check recent deployments (last 1 hour)

## Common Fixes
### Issue: All 502 errors
1. Identify last deployment: `kubectl rollout history deployment/api-gateway`
2. Rollback: `kubectl rollout undo deployment/api-gateway`
3. Verify: `curl -I https://api.example.com/health`

### Issue: High latency
1. Check upstream response times
2. Scale replicas: `kubectl scale deployment/api-gateway --replicas=10`
3. Check database connection pool

## Escalation
- Private: Slack #incident-api-gateway
- PagerDuty API on-call phone number

Blameless Postmortem Culture

A blameless postmortem focuses on system failures and process gaps, not individual mistakes.

# Postmortem Template

## Incident Summary
- Date: 2024-06-20
- Duration: 12 minutes
- Severity: SEV-1
- Services affected: api-gateway

## Timeline
- 10:30 UTC - Alert fired: Gateway error rate > 5%
- 10:31 UTC - IC assigned: @alice
- 10:32 UTC - Identified as deployment rollback candidate
- 10:38 UTC - Rollback initiated
- 10:42 UTC - Traffic restored, all healthy

## Root Cause
A config change in nginx.conf (deployed in v2.3.1) removed the proxy_pass
directive. All requests hit NGINX but had no upstream to forward to.

## Action Items
| Action | Owner | Severity |
|--------|-------|----------|
| Add automated nginx config validation to CI pipeline | @bob | P0 |
| Add "proxy_pass present" check to deployment health check | @carol | P0 |
| Runbook: add step for verifying upstream after deploy | @alice | P1 |

## What Went Well
- Alert fired within 2 minutes of deployment
- Rollback completed in 6 minutes
- Communication stayed in the designated channel

## What Went Wrong
- No validation step in CI for nginx configs
- Deployment health check didn't verify upstream connectivity
- On-call had to manually identify the problematic deployment

Common Mistakes

Not declaring an incident early: Teams hesitate to call an incident, wasting time trying to fix it quietly. Declare early — you can always downgrade. Time wasted hesitating is time the system stays down.
No incident commander: Without a designated IC, multiple people independently try fixes, sometimes conflicting. One person coordinates; everyone else executes.
Poor communication: Updates in private DMs instead of a public channel. Stakeholders (support, management) not informed. The incident channel should be public and linked from the status page.
Blaming individuals in postmortems: If a person made a mistake, the system allowed it — there were no guardrails. Fix the process, not the person.
Not updating runbooks after incidents: The same issue recurs because the fix process was never documented. After every incident, update the relevant runbook with the resolution steps.

Practice Questions

What are the five stages of incident response? Answer: Detection (alert fires), Triage (assess severity), Mitigation (stop the bleeding), Resolution (apply permanent fix), Postmortem (learn and prevent).
What is the role of the Incident Commander? Answer: The IC coordinates the response, communicates status, and triages incoming information. They do NOT debug — their job is to enable others to debug effectively.
What is a blameless postmortem? Answer: A post-incident analysis focused on system and process failures, not individual mistakes. The assumption is that people made the best decisions with the information they had at the time.
How does severity classification affect response? Answer: Higher severity (SEV-1) requires immediate response (5 min), 24/7 IC assignment, and executive notification. Lower severity (SEV-4) can wait until the next business day.

Challenge

Design an incident response system for a three-service application (web, API, database): define severity levels for each service, write runbooks for the top 3 failure scenarios per service, set up PagerDuty schedules with follow-the-sun rotation, create a blameless postmortem template, and conduct a tabletop exercise simulating a database failure.

FAQ

What is the difference between a runbook and a postmortem?

: A runbook is a preventive document — how to fix known issues. A postmortem is a reactive document — what happened, why, and how to prevent recurrence.

How do I start an on-call rotation with a small team?

: Start with a primary and secondary on-call rotation. Use PagerDuty or Opsgenie free tier. Rotate weekly. Ensure every on-call has a backup within 15 minutes of escalation.

What tools do I need for incident response?

: Essential: Alerting (PagerDuty/Opsgenie), communication (Slack/Discord), runbooks (confluence/GitHub), status page (Statuspage.io). Nice to have: war room bridge (Zoom), incident management platform (FireHydrant).

How do I prevent alert fatigue?

: Use severity levels, tune thresholds with for: clauses, deduplicate related alerts, and suppress known issues. Only page for SEV-1 and SEV-2. Create tickets for SEV-3+.

What metrics measure incident response effectiveness?

: MTTD (Time to Detect), MTTR (Time to Resolve), MTBF (Time Between Failures), incident frequency, and action item closure rate.