Site Reliability Engineering (SRE): Complete Guide
Site Reliability Engineering (SRE) applies software engineering principles to operations problems — treating infrastructure as code, automating away toil, and using data-driven decisions to balance reliability with feature velocity.
What You’ll Learn
- Core SRE principles: error budgets, SLIs/SLOs, and toil reduction
- Implementing automation for operational tasks
- Incident management and capacity planning practices
- How SRE differs from DevOps and how to adopt SRE
Why SRE Matters
Traditional operations teams react to problems. SRE teams proactively design systems for reliability. When Google introduced SRE, they found that automating operational work reduced incidents by 80% while allowing the same team to manage 10x more infrastructure. DodaTech implements SRE practices for Durga Antivirus Pro’s update infrastructure — measuring SLIs, tracking error budgets, and automating deployment pipelines to reduce manual toil.
flowchart LR
A[DevOps & Monitoring] --> B[SRE]
B --> C[SLIs & SLOs]
B --> D[Error Budgets]
B --> E[Toil Reduction]
B --> F[Incident Response]
C --> G[Measure Reliability]
D --> H[Balance Velocity vs Stability]
E --> I[Automate Operations]
style B fill:#526cf7,color:#fff
SRE vs DevOps
SRE and DevOps share goals but differ in approach:
| Aspect | DevOps | SRE |
|---|---|---|
| Origin | Industry movement | Google’s internal practice |
| Focus | Culture, collaboration, automation | Reliability, SLIs, error budgets |
| Key metric | Deployment frequency | Uptime / Error budget |
| Failure handling | Blameless culture | Blameless + error budget burn |
| Team structure | Dev + Ops merged | Software engineers writing operations code |
SLIs, SLOs, and Error Budgets
An SLI (Service Level Indicator) measures a specific aspect of reliability. An SLO (Service Level Objective) sets a target for that SLI. The error budget is the acceptable amount of unreliability — 100% minus the SLO target.
# error_budget.py
class ErrorBudget:
def __init__(self, slo_percent, window_days):
self.slo = slo_percent
self.window = window_days
self.total_requests = 0
self.failed_requests = 0
def record_success(self):
self.total_requests += 1
def record_failure(self):
self.total_requests += 1
self.failed_requests += 1
def availability(self):
if self.total_requests == 0:
return 100.0
return (1 - self.failed_requests / self.total_requests) * 100
def budget_remaining(self):
avail = self.availability()
error_rate = 100 - avail
max_error = 100 - self.slo
remaining = max(0, max_error - error_rate)
return remaining / max_error * 100 if max_error > 0 else 0
def is_budget_exhausted(self):
return self.budget_remaining() <= 0
budget = ErrorBudget(slo_percent=99.9, window_days=30)
for i in range(10000):
if i % 50 == 0: # Simulate 2% failure rate
budget.record_failure()
else:
budget.record_success()
print(f"Current availability: {budget.availability():.3f}%")
print(f"SLO target: 99.9%")
print(f"Error budget remaining: {budget.budget_remaining():.1f}%")
# Output:
# Current availability: 98.000%
# SLO target: 99.9%
# Error budget remaining: 0.0%Defining SLIs
# Availability SLI: percentage of successful requests
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
* 100
# Latency SLI: p99 response time
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Freshness SLI: age of last successful data sync
time() - max(last_sync_timestamp_seconds)
# Output: Single value per query window
# {instance="api-1"} 99.87Multi-Window Multi-Burn-Rate Alerts
Advanced SRE alerting uses multiple windows to detect error budget burn early:
groups:
- name: sre_alerts
rules:
# Page if burn rate is high (14x faster than budget allows)
- alert: PageHighBurnRate
expr: |
(1 - (sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))))
> 14 * (1 - 0.999) # 14x the allowed error rate
for: 5m
labels:
severity: critical
# Ticket if burn rate is moderate
- alert: TicketMediumBurnRate
expr: |
(1 - (sum(rate(http_requests_total{status!~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))))
> 3 * (1 - 0.999)
for: 30m
labels:
severity: warningToil Reduction
Toil is manual, repetitive, automatable work. SRE teams aim to keep toil under 50% of their time.
# toil_tracker.py
class ToilTracker:
def __init__(self):
self.activities = []
def log(self, task, category, duration_minutes, automatable):
self.activities.append({
"task": task,
"category": category,
"minutes": duration_minutes,
"automatable": automatable,
})
def report(self):
total = sum(a["minutes"] for a in self.activities)
toil = sum(a["minutes"] for a in self.activities if a["automatable"])
print(f"Total ops time: {total} minutes")
print(f"Toil (automatable): {toil} minutes ({toil/total*100:.0f}%)")
print(f"Engineering work: {total - toil} minutes ({(total-toil)/total*100:.0f}%)")
tracker = ToilTracker()
tracker.log("Restart crashed service", "incident", 5, True)
tracker.log("Deploy new release", "deploy", 15, True)
tracker.log("Review architecture doc", "design", 45, False)
tracker.log("Answer team slack question", "support", 10, True)
tracker.report()
# Output:
# Total ops time: 75 minutes
# Toil (automatable): 30 minutes (40%)
# Engineering work: 45 minutes (60%)Capacity Planning
SRE teams forecast resource needs based on traffic growth:
# capacity_planner.py
def forecast_capacity(current_usage, growth_rate_pct, months):
result = []
for m in range(months + 1):
projected = current_usage * (1 + growth_rate_pct / 100) ** m
result.append({"month": m, "usage": round(projected, 1)})
return result
plan = forecast_capacity(1000, 15, 12)
print("=== Capacity Forecast (15% monthly growth) ===")
for p in plan:
arrow = " ← NOW" if p["month"] == 0 else ""
print(f" Month {p['month']:>2}: {p['usage']:>8.1f} units{arrow}")Expected output:
=== Capacity Forecast (15% monthly growth) ===
Month 0: 1000.0 units ← NOW
Month 1: 1150.0 units
Month 2: 1322.5 units
Month 6: 2313.1 units
Month 12: 5350.2 unitsSRE Implementation Roadmap
- Identify the most critical service — Start with one service, not all
- Define 2-3 SLIs — Availability, latency, and a service-specific SLI
- Set an SLO target — 99.9% for most services, 99.99% for critical ones
- Measure error budget — Track remaining budget in a dashboard
- Automate the top toil source — Identify the most time-consuming manual task
- Implement multi-window alerts — Page only when error budget is at risk
- Conduct blameless postmortems — Learn from incidents without blame
Common Mistakes
Setting SLOs too high (99.999%): Four nines (99.99%) allows 52 minutes of downtime per year. Five nines allows 5 minutes. Most services don’t need five nines. Start with 99.9%.
Not tracking error budget for decision making: If the error budget is full, teams should feel empowered to deploy. If it’s exhausted, deployments should slow down. Without this feedback loop, SLOs are just numbers.
Measuring SLIs of non-user-facing systems: An SLO for database CPU usage is an infrastructure metric, not an SLI. Measure what users experience — request success rate, latency, availability.
Automating everything immediately: Some manual processes should stay manual if they happen rarely. Focus automation on high-frequency, high-toil tasks first.
Treating SRE as a separate team that “fixes things”: SRE is a practice, not a silo. Every team should own their reliability. SRE teams enable and consult, not take over operations.
Practice Questions
What is the difference between SRE and DevOps? Answer: DevOps is a cultural movement about collaboration between dev and ops. SRE is a specific implementation of DevOps principles using error budgets, SLIs, and software engineering for operations.
What is an error budget and how is it used? Answer: Error budget = 100% - SLO target. It represents the acceptable amount of unreliability. When the budget is exhausted, teams stop shipping new features and focus on reliability.
What is toil and why should SRE teams reduce it? Answer: Toil is manual, repetitive, automatable work (restarting services, manual deployments, answering repetitive questions). It burns out engineers and doesn’t scale. SRE teams should keep toil under 50% of time.
How do multi-window multi-burn-rate alerts work? Answer: They use short windows (1h) to detect fast burn rates and long windows (6h) to detect slow burn rates. Fast burn pages immediately; slow burn creates tickets. This catches both sudden outages and gradual degradation.
Challenge
Implement SRE for a sample web service: define availability and latency SLIs, set a 99.9% SLO, create a Grafana dashboard showing error budget burn rate, write Prometheus alerting rules for high and medium burn rates, identify and automate the top toil source, and simulate an incident that exhausts the error budget.
FAQ
What’s Next
| Topic | Description |
|---|---|
| Handling production outages | |
| Measuring reliability |
Related topics: Prometheus, Grafana, SLI, Incident Response
What’s Next
Congratulations on completing this SRE tutorial! Here’s where to go from here:
- Practice daily — Define SLIs for one of your services today
- Build a project — Create an error budget dashboard for your team
- Explore related topics — Check out incident response and chaos engineering
Remember: every expert was once a beginner. Keep coding!
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro