Learn DevOps: Site Reliability Engineering (SRE): Complete Guide

Site Reliability Engineering (SRE): Complete Guide

DodaTech Updated Jun 20, 2026 7 min read

Site Reliability Engineering (SRE) applies software engineering principles to operations problems — treating infrastructure as code, automating away toil, and using data-driven decisions to balance reliability with feature velocity.

What You’ll Learn

Core SRE principles: error budgets, SLIs/SLOs, and toil reduction
Implementing automation for operational tasks
Incident management and capacity planning practices
How SRE differs from DevOps and how to adopt SRE

Why SRE Matters

Traditional operations teams react to problems. SRE teams proactively design systems for reliability. When Google introduced SRE, they found that automating operational work reduced incidents by 80% while allowing the same team to manage 10x more infrastructure. DodaTech implements SRE practices for Durga Antivirus Pro’s update infrastructure — measuring SLIs, tracking error budgets, and automating deployment pipelines to reduce manual toil.

    flowchart LR
    A[DevOps & Monitoring] --> B[SRE]
    B --> C[SLIs & SLOs]
    B --> D[Error Budgets]
    B --> E[Toil Reduction]
    B --> F[Incident Response]
    C --> G[Measure Reliability]
    D --> H[Balance Velocity vs Stability]
    E --> I[Automate Operations]
    style B fill:#526cf7,color:#fff

Prerequisites: Familiarity with DevOps practices, CI/CD pipelines, and monitoring concepts. Experience with Linux and Bash.

SRE vs DevOps

SRE and DevOps share goals but differ in approach:

Aspect	DevOps	SRE
Origin	Industry movement	Google’s internal practice
Focus	Culture, collaboration, automation	Reliability, SLIs, error budgets
Key metric	Deployment frequency	Uptime / Error budget
Failure handling	Blameless culture	Blameless + error budget burn
Team structure	Dev + Ops merged	Software engineers writing operations code

SLIs, SLOs, and Error Budgets

An SLI (Service Level Indicator) measures a specific aspect of reliability. An SLO (Service Level Objective) sets a target for that SLI. The error budget is the acceptable amount of unreliability — 100% minus the SLO target.

# error_budget.py
class ErrorBudget:
    def __init__(self, slo_percent, window_days):
        self.slo = slo_percent
        self.window = window_days
        self.total_requests = 0
        self.failed_requests = 0

    def record_success(self):
        self.total_requests += 1

    def record_failure(self):
        self.total_requests += 1
        self.failed_requests += 1

    def availability(self):
        if self.total_requests == 0:
            return 100.0
        return (1 - self.failed_requests / self.total_requests) * 100

    def budget_remaining(self):
        avail = self.availability()
        error_rate = 100 - avail
        max_error = 100 - self.slo
        remaining = max(0, max_error - error_rate)
        return remaining / max_error * 100 if max_error > 0 else 0

    def is_budget_exhausted(self):
        return self.budget_remaining() <= 0

budget = ErrorBudget(slo_percent=99.9, window_days=30)
for i in range(10000):
    if i % 50 == 0:  # Simulate 2% failure rate
        budget.record_failure()
    else:
        budget.record_success()

print(f"Current availability: {budget.availability():.3f}%")
print(f"SLO target: 99.9%")
print(f"Error budget remaining: {budget.budget_remaining():.1f}%")

# Output:
# Current availability: 98.000%
# SLO target: 99.9%
# Error budget remaining: 0.0%

Defining SLIs

# Availability SLI: percentage of successful requests
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
* 100

# Latency SLI: p99 response time
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Freshness SLI: age of last successful data sync
time() - max(last_sync_timestamp_seconds)

# Output: Single value per query window
# {instance="api-1"} 99.87

Multi-Window Multi-Burn-Rate Alerts

Advanced SRE alerting uses multiple windows to detect error budget burn early:

groups:
  - name: sre_alerts
    rules:
      # Page if burn rate is high (14x faster than budget allows)
      - alert: PageHighBurnRate
        expr: |
          (1 - (sum(rate(http_requests_total{status!~"5.."}[1h]))
                / sum(rate(http_requests_total[1h]))))
          > 14 * (1 - 0.999)  # 14x the allowed error rate
        for: 5m
        labels:
          severity: critical

      # Ticket if burn rate is moderate
      - alert: TicketMediumBurnRate
        expr: |
          (1 - (sum(rate(http_requests_total{status!~"5.."}[6h]))
                / sum(rate(http_requests_total[6h]))))
          > 3 * (1 - 0.999)
        for: 30m
        labels:
          severity: warning

Toil Reduction

Toil is manual, repetitive, automatable work. SRE teams aim to keep toil under 50% of their time.

# toil_tracker.py
class ToilTracker:
    def __init__(self):
        self.activities = []

    def log(self, task, category, duration_minutes, automatable):
        self.activities.append({
            "task": task,
            "category": category,
            "minutes": duration_minutes,
            "automatable": automatable,
        })

    def report(self):
        total = sum(a["minutes"] for a in self.activities)
        toil = sum(a["minutes"] for a in self.activities if a["automatable"])
        print(f"Total ops time: {total} minutes")
        print(f"Toil (automatable): {toil} minutes ({toil/total*100:.0f}%)")
        print(f"Engineering work: {total - toil} minutes ({(total-toil)/total*100:.0f}%)")

tracker = ToilTracker()
tracker.log("Restart crashed service", "incident", 5, True)
tracker.log("Deploy new release", "deploy", 15, True)
tracker.log("Review architecture doc", "design", 45, False)
tracker.log("Answer team slack question", "support", 10, True)
tracker.report()

# Output:
# Total ops time: 75 minutes
# Toil (automatable): 30 minutes (40%)
# Engineering work: 45 minutes (60%)

Capacity Planning

SRE teams forecast resource needs based on traffic growth:

# capacity_planner.py
def forecast_capacity(current_usage, growth_rate_pct, months):
    result = []
    for m in range(months + 1):
        projected = current_usage * (1 + growth_rate_pct / 100) ** m
        result.append({"month": m, "usage": round(projected, 1)})
    return result

plan = forecast_capacity(1000, 15, 12)
print("=== Capacity Forecast (15% monthly growth) ===")
for p in plan:
    arrow = " ← NOW" if p["month"] == 0 else ""
    print(f"  Month {p['month']:>2}: {p['usage']:>8.1f} units{arrow}")

Expected output:

=== Capacity Forecast (15% monthly growth) ===
  Month  0:   1000.0 units ← NOW
  Month  1:   1150.0 units
  Month  2:   1322.5 units
  Month  6:   2313.1 units
  Month 12:   5350.2 units

SRE Implementation Roadmap

Identify the most critical service — Start with one service, not all
Define 2-3 SLIs — Availability, latency, and a service-specific SLI
Set an SLO target — 99.9% for most services, 99.99% for critical ones
Measure error budget — Track remaining budget in a dashboard
Automate the top toil source — Identify the most time-consuming manual task
Implement multi-window alerts — Page only when error budget is at risk
Conduct blameless postmortems — Learn from incidents without blame

Common Mistakes

Setting SLOs too high (99.999%): Four nines (99.99%) allows 52 minutes of downtime per year. Five nines allows 5 minutes. Most services don’t need five nines. Start with 99.9%.
Not tracking error budget for decision making: If the error budget is full, teams should feel empowered to deploy. If it’s exhausted, deployments should slow down. Without this feedback loop, SLOs are just numbers.
Measuring SLIs of non-user-facing systems: An SLO for database CPU usage is an infrastructure metric, not an SLI. Measure what users experience — request success rate, latency, availability.
Automating everything immediately: Some manual processes should stay manual if they happen rarely. Focus automation on high-frequency, high-toil tasks first.
Treating SRE as a separate team that “fixes things”: SRE is a practice, not a silo. Every team should own their reliability. SRE teams enable and consult, not take over operations.

Practice Questions

What is the difference between SRE and DevOps? Answer: DevOps is a cultural movement about collaboration between dev and ops. SRE is a specific implementation of DevOps principles using error budgets, SLIs, and software engineering for operations.
What is an error budget and how is it used? Answer: Error budget = 100% - SLO target. It represents the acceptable amount of unreliability. When the budget is exhausted, teams stop shipping new features and focus on reliability.
What is toil and why should SRE teams reduce it? Answer: Toil is manual, repetitive, automatable work (restarting services, manual deployments, answering repetitive questions). It burns out engineers and doesn’t scale. SRE teams should keep toil under 50% of time.
How do multi-window multi-burn-rate alerts work? Answer: They use short windows (1h) to detect fast burn rates and long windows (6h) to detect slow burn rates. Fast burn pages immediately; slow burn creates tickets. This catches both sudden outages and gradual degradation.

Challenge

Implement SRE for a sample web service: define availability and latency SLIs, set a 99.9% SLO, create a Grafana dashboard showing error budget burn rate, write Prometheus alerting rules for high and medium burn rates, identify and automate the top toil source, and simulate an incident that exhausts the error budget.

FAQ

Do I need an SRE team to practice SRE?

: No. Start by applying SRE principles — SLIs, SLOs, error budgets, toil reduction — within your existing DevOps team. Dedicated SRE teams are for organizations with enough scale to justify it.

Is SRE only for large companies?

: No. Google created SRE for their scale, but the principles apply at any scale. A startup can define SLIs, set SLOs, and track error budgets with free tools.

What is the difference between an SLA and an SLO?

: SLO is an internal target for your team. SLA is a contractual commitment to customers. SLAs are typically looser than SLOs (99.9% SLA means SLO should be 99.95%).

How much automation is enough?

: Target 50% or less toil. If your team spends more than half their time on manual operations, prioritize automation. If they’re under 50%, focus on engineering improvements.

What is the most important SRE metric?

: Error budget burn rate. It tells you whether you’re exhausting reliability budget too fast and needs immediate action.