Incident Response: Handling Production Outages
Incident response is the structured process of detecting, responding to, and recovering from production outages — minimizing downtime, communicating clearly, and preventing recurrence through blameless learning.
What You’ll Learn
- Incident response process: detection, triage, mitigation, resolution, follow-up
- Severity levels, on-call rotations, and escalation policies
- Writing effective runbooks and conducting blameless postmortems
- Setting up PagerDuty and Opsgenie for alerting and on-call scheduling
Why Incident Response Matters
Without a structured incident response process, an outage becomes chaos — multiple people SSH into the same server, no one knows who’s coordinating, communication is fragmented across Slack DMs, and the fix takes 3x longer than necessary. A good incident response process cuts Mean Time to Recovery (MTTR) by 50-80%. DodaTech uses PagerDuty for on-call scheduling and incident routing for Durga Antivirus Pro’s production infrastructure — every alert reaches the right person with context about the affected system.
flowchart LR
A[SRE Principles] --> B[Incident Response]
B --> C[Detection]
B --> D[Triage]
B --> E[Mitigation]
B --> F[Resolution]
B --> G[Postmortem]
C --> H[Alerts / Monitoring]
D --> I[Severity Assessment]
E --> J[Rollback / Fix]
style B fill:#e53e3e,color:#fff
Incident Response Process
Severity Levels
| Severity | Definition | Response Time | Example |
|---|---|---|---|
| SEV-1 | Complete outage, all users affected | 5 min | Database primary down |
| SEV-2 | Major feature degraded, partial outage | 15 min | API latency > 5s |
| SEV-3 | Minor issue, workaround available | 1 hour | UI display bug |
| SEV-4 | Cosmetic, no user impact | Next business day | Documentation error |
Incident Commander Role
The Incident Commander (IC) coordinates the response and does NOT debug:
# incident_commander.py
class Incident:
def __init__(self, severity, service, description):
self.severity = severity
self.service = service
self.description = description
self.timeline = []
self.ic = None
self.responders = []
self.status = "detected"
def assign_ic(self, name):
self.ic = name
self.timeline.append(f"IC assigned: {name}")
def add_responder(self, name, role):
self.responders.append({"name": name, "role": role})
self.timeline.append(f"Responder added: {name} ({role})")
def update_status(self, status, message):
self.status = status
self.timeline.append(f"[{status.upper()}] {message}")
def summary(self):
print(f"=== Incident Summary ===")
print(f"Severity: {self.severity}")
print(f"Service: {self.service}")
print(f"Description: {self.description}")
print(f"IC: {self.ic}")
print(f"Status: {self.status}")
for entry in self.timeline:
print(f" {entry}")
incident = Incident("SEV-1", "api-gateway", "All requests returning 502")
incident.assign_ic("Alice")
incident.add_responder("Bob", "Networking")
incident.add_responder("Carol", "Backend")
incident.update_status("mitigating", "Rolling back last deployment")
incident.update_status("resolved", "Rollback complete, traffic restored")
incident.summary()Expected output:
=== Incident Summary ===
Severity: SEV-1
Service: api-gateway
Description: All requests returning 502
IC: Alice
Status: resolved
IC assigned: Alice
Responder added: Bob (Networking)
Responder added: Carol (Backend)
[MITIGATING] Rolling back last deployment
[RESOLVED] Rollback complete, traffic restoredCommunication Templates
# INCIDENT NOTIFICATION
# --------------------
# Severity: SEV-1
# Service: api-gateway
# Start Time: 2024-06-20 10:30 UTC
# Summary: All requests returning 502
# Impact: 100% of users affected
# IC: @alice
# Responders: @bob (networking), @carol (backend)
# Status: Investigation ongoing
# Updates: Every 15 minutes in #incident-api-gateway
# RESOLUTION POSTMORTEM
# --------------------
# What happened: v2.3.1 deployment included a misconfigured upstream
# Root cause: Config change in nginx.conf removed proxy_pass directive
# Detection: Datadog alert "Gateway Error Rate > 1%" fired
# Resolution: Rolled back to v2.3.0
# Time to detection: 2 minutes
# Time to resolution: 12 minutesOn-Call Rotations
Setting Up PagerDuty
# PagerDuty best practices:
# 1. Create schedules with follow-the-sun rotation
# - US team: 9am-5pm EST
# - EU team: 9am-5pm CET
# - APAC team: 9am-5pm JST
# 2. Set up escalation policies:
# - Level 1: Primary on-call (15 min)
# - Level 2: Secondary on-call (15 min)
# - Level 3: Engineering manager
# 3. Configure service integrations:
# - Prometheus Alertmanager → PagerDuty webhook
# - Datadog → PagerDuty direct integration# Alertmanager config for PagerDuty
route:
receiver: pagerduty-prod
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: pagerduty-prod
pagerduty_configs:
- routing_key: "<your-pagerduty-integration-key>"
severity: "{{ .CommonLabels.severity }}"Runbook Structure
Every service must have a runbook:
# Runbook: api-gateway
## Symptoms
- 502 Bad Gateway
- Latency > 5s
- Connection refused
## Initial Checks
1. Check Datadog dashboard: "API Gateway Overview"
2. Check upstream service health
3. Check recent deployments (last 1 hour)
## Common Fixes
### Issue: All 502 errors
1. Identify last deployment: `kubectl rollout history deployment/api-gateway`
2. Rollback: `kubectl rollout undo deployment/api-gateway`
3. Verify: `curl -I https://api.example.com/health`
### Issue: High latency
1. Check upstream response times
2. Scale replicas: `kubectl scale deployment/api-gateway --replicas=10`
3. Check database connection pool
## Escalation
- Private: Slack #incident-api-gateway
- PagerDuty API on-call phone numberBlameless Postmortem Culture
A blameless postmortem focuses on system failures and process gaps, not individual mistakes.
# Postmortem Template
## Incident Summary
- Date: 2024-06-20
- Duration: 12 minutes
- Severity: SEV-1
- Services affected: api-gateway
## Timeline
- 10:30 UTC - Alert fired: Gateway error rate > 5%
- 10:31 UTC - IC assigned: @alice
- 10:32 UTC - Identified as deployment rollback candidate
- 10:38 UTC - Rollback initiated
- 10:42 UTC - Traffic restored, all healthy
## Root Cause
A config change in nginx.conf (deployed in v2.3.1) removed the proxy_pass
directive. All requests hit NGINX but had no upstream to forward to.
## Action Items
| Action | Owner | Severity |
|--------|-------|----------|
| Add automated nginx config validation to CI pipeline | @bob | P0 |
| Add "proxy_pass present" check to deployment health check | @carol | P0 |
| Runbook: add step for verifying upstream after deploy | @alice | P1 |
## What Went Well
- Alert fired within 2 minutes of deployment
- Rollback completed in 6 minutes
- Communication stayed in the designated channel
## What Went Wrong
- No validation step in CI for nginx configs
- Deployment health check didn't verify upstream connectivity
- On-call had to manually identify the problematic deploymentCommon Mistakes
Not declaring an incident early: Teams hesitate to call an incident, wasting time trying to fix it quietly. Declare early — you can always downgrade. Time wasted hesitating is time the system stays down.
No incident commander: Without a designated IC, multiple people independently try fixes, sometimes conflicting. One person coordinates; everyone else executes.
Poor communication: Updates in private DMs instead of a public channel. Stakeholders (support, management) not informed. The incident channel should be public and linked from the status page.
Blaming individuals in postmortems: If a person made a mistake, the system allowed it — there were no guardrails. Fix the process, not the person.
Not updating runbooks after incidents: The same issue recurs because the fix process was never documented. After every incident, update the relevant runbook with the resolution steps.
Practice Questions
What are the five stages of incident response? Answer: Detection (alert fires), Triage (assess severity), Mitigation (stop the bleeding), Resolution (apply permanent fix), Postmortem (learn and prevent).
What is the role of the Incident Commander? Answer: The IC coordinates the response, communicates status, and triages incoming information. They do NOT debug — their job is to enable others to debug effectively.
What is a blameless postmortem? Answer: A post-incident analysis focused on system and process failures, not individual mistakes. The assumption is that people made the best decisions with the information they had at the time.
How does severity classification affect response? Answer: Higher severity (SEV-1) requires immediate response (5 min), 24/7 IC assignment, and executive notification. Lower severity (SEV-4) can wait until the next business day.
Challenge
Design an incident response system for a three-service application (web, API, database): define severity levels for each service, write runbooks for the top 3 failure scenarios per service, set up PagerDuty schedules with follow-the-sun rotation, create a blameless postmortem template, and conduct a tabletop exercise simulating a database failure.
FAQ
What’s Next
| Topic | Description |
|---|---|
| Testing resilience proactively | |
| Measuring reliability |
Related topics: SRE, Prometheus, Grafana, PagerDuty
What’s Next
Congratulations on completing this Incident Response tutorial! Here’s where to go from here:
- Practice daily — Review your team’s incident response process for gaps
- Build a project — Set up PagerDuty alerts for your monitoring stack
- Explore related topics — Check out chaos engineering and SRE fundamentals
Remember: every expert was once a beginner. Keep coding!
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro