Skip to content
SLIs, SLOs, and SLAs: Measuring Reliability

SLIs, SLOs, and SLAs: Measuring Reliability

DodaTech Updated Jun 20, 2026 8 min read

Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) form the measurement framework for reliability — quantifying what “healthy” means for your systems and creating accountability.

What You’ll Learn

  • Defining SLIs: latency, availability, error rate, throughput, freshness
  • Setting SLO targets and managing error budgets
  • Understanding SLAs and their relationship to SLOs
  • Multi-window multi-burn-rate alerting for early detection
  • Building reliability dashboards and reporting

Why SLIs, SLOs, and SLAs Matter

Without SLIs, you can’t measure reliability. Without SLOs, you don’t know if it’s good enough. Without SLAs, there’s no accountability. This framework transforms reliability from a vague aspiration into a measurable, actionable practice. DodaTech tracks SLIs for Durga Antivirus Pro’s signature update service — measuring update latency (p99 < 5s), availability (> 99.9%), and freshness (signatures updated within 1 hour of release) to ensure users receive timely protection.

    flowchart LR
    A[SRE & Monitoring] --> B[SLIs / SLOs / SLAs]
    B --> C[SLIs - What We Measure]
    B --> D[SLOs - Our Targets]
    B --> E[SLAs - Customer Contracts]
    C --> F[Latency, Availability, Error Rate]
    D --> G[Error Budgets]
    E --> H[Penalties / Commitments]
    G --> I[Release Decisions]
    style B fill:#38a169,color:#fff
  
Prerequisites: Understanding of SRE principles and monitoring. Familiarity with Prometheus and PromQL.

Service Level Indicators (SLIs)

An SLI is a specific metric that measures one aspect of service quality. Each SLI must be measurable, meaningful to users, and actionable.

Common SLIs

CategorySLIMeasurement
AvailabilityRequest success ratesuccessful / total * 100
LatencyResponse time at percentilep50, p95, p99 in ms
Error RateFailed requests5xx + 4xx / total * 100
ThroughputRequests per secondrequests / second
FreshnessAge of last data updatetime() - last_update_time
DurabilityData loss rateobjects_lost / total_objects

Measuring SLIs in Prometheus

# Availability SLI: percentage of non-5xx responses
avg(
  sum(rate(http_requests_total{status!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
) * 100

# Latency SLI: p99 response time in seconds
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# Error rate SLI: percentage of 5xx responses
avg(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
) * 100

# Output: Single values representing current reliability
# Availability: 99.87%
# p99 Latency: 0.345s
# Error Rate: 0.13%

Defining Custom SLIs in Code

# sli_calculator.py
from datetime import datetime, timedelta
import random

class SLICalculator:
    def __init__(self, window_minutes=5):
        self.window = timedelta(minutes=window_minutes)
        self.requests = []

    def record(self, success, latency_ms):
        self.requests.append({
            "timestamp": datetime.now(),
            "success": success,
            "latency": latency_ms,
        })

    def availability(self):
        recent = [r for r in self.requests
                  if datetime.now() - r["timestamp"] < self.window]
        if not recent:
            return 100.0
        successes = sum(1 for r in recent if r["success"])
        return successes / len(recent) * 100

    def p99_latency(self):
        recent = sorted(
            [r["latency"] for r in self.requests
             if datetime.now() - r["timestamp"] < self.window]
        )
        if not recent:
            return 0
        idx = int(len(recent) * 0.99)
        return recent[min(idx, len(recent) - 1)]

sli = SLICalculator()
for i in range(1000):
    success = random.random() > 0.005  # 0.5% failure rate
    latency = random.gauss(150, 50)  # mean 150ms, stddev 50ms
    sli.record(success, latency)

print(f"Availability SLI: {sli.availability():.2f}%")
print(f"p99 Latency SLI: {sli.p99_latency():.0f}ms")

# Output:
# Availability SLI: 99.50%
# p99 Latency SLI: 298ms

Service Level Objectives (SLOs)

An SLO is a target value for an SLI over a time window. It drives engineering decisions.

SLISLO TargetWindow
Availability≥ 99.9%30 days
p99 Latency≤ 500ms7 days
Error Rate≤ 1%30 days

Error Budget

Error budget = 100% - SLO target. It represents the acceptable amount of unreliability.

# error_budget_dashboard.py
class ErrorBudgetDashboard:
    def __init__(self, sli_name, slo_pct, window_days):
        self.sli_name = sli_name
        self.slo = slo_pct
        self.window = window_days
        self.budget = 100 - slo_pct  # Total budget
        self.consumed = 0

    def record_period(self, sli_value_pct):
        error_rate = 100 - sli_value_pct
        self.consumed += error_rate
        remaining = max(0, self.budget - self.consumed)
        print(f"Period SLI: {sli_value_pct:.2f}%")
        print(f"  Error this period: {error_rate:.2f}%")
        print(f"  Budget remaining: {remaining:.2f}% (consumed {self.consumed:.2f}% of {self.budget:.2f}%)")
        return remaining

dash = ErrorBudgetDashboard("availability", 99.9, 30)
dash.record_period(99.95)  # 0.05% error
dash.record_period(99.80)  # 0.20% error
dash.record_period(99.50)  # 0.50% error - budget critical

Expected output:

Period SLI: 99.95%
  Error this period: 0.05%
  Budget remaining: 0.05% (consumed 0.05% of 0.10%)
Period SLI: 99.80%
  Error this period: 0.20%
  Budget remaining: -0.15% (consumed 0.25% of 0.10%)
Period SLI: 99.50%
  Error this period: 0.50%
  Budget remaining: -0.65% (consumed 0.75% of 0.10%)

Multi-Window Multi-Burn-Rate Alerts

The most effective SLO alerting approach uses multiple time windows:

groups:
  - name: slo_alerts
    rules:
      # Page: burning budget fast (> 14x allowed burn rate)
      - alert: SLOHighBurnRate
        expr: |
          (1 - (sum(rate(http_requests_total{status!~"5.."}[1h]))
                / sum(rate(http_requests_total[1h]))))
          * 100 > 14 * (100 - 99.9)
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Error budget burning > 14x allowed rate"

      # Ticket: burning budget moderately (> 3x allowed burn rate)
      - alert: SLOMediumBurnRate
        expr: |
          (1 - (sum(rate(http_requests_total{status!~"5.."}[6h]))
                / sum(rate(http_requests_total[6h]))))
          * 100 > 3 * (100 - 99.9)
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "Error budget burning > 3x allowed rate"

Burn Rate Explained

A burn rate of 1x means you’ll exhaust the error budget over the full window. A burn rate of 14x means you’d exhaust the budget in 1/14th of the window — roughly 2 days for a 30-day window. This triggers a page.

SLAs

An SLA is a contractual commitment to customers, typically looser than your internal SLOs:

CommitmentSLO (internal)SLA (customer)
Availability99.95%99.9%
Latencyp99 < 300msp99 < 1s
Support response15 min SEV-11 hour SEV-1

Rule: Always set SLOs tighter than SLAs. If your SLA is 99.9%, set your SLO to 99.95%. The buffer (0.05%) is your safety margin.

Reporting

# slo_report.py
def generate_slo_report(services):
    print("=== SLO Compliance Report ===")
    print(f"{'Service':<20} {'SLI':<15} {'SLO':<10} {'Actual':<10} {'Status':<10}")
    print("-" * 65)
    for s in services:
        compliant = s["actual"] >= s["slo"]
        status = "PASS" if compliant else "FAIL"
        print(f"{s['name']:<20} {s['sli']:<15} {s['slo']:<10} "
              f"{s['actual']:<10} {status:<10}")

services = [
    {"name": "api-gateway", "sli": "Availability", "slo": 99.95, "actual": 99.97},
    {"name": "api-gateway", "sli": "p99 Latency", "slo": 500, "actual": 320},
    {"name": "user-db", "sli": "Availability", "slo": 99.99, "actual": 99.98},
    {"name": "cdn", "sli": "Availability", "slo": 99.9, "actual": 99.95},
]
generate_slo_report(services)

Expected output:

=== SLO Compliance Report ===
Service              SLI             SLO        Actual     Status
-----------------------------------------------------------------
api-gateway          Availability    99.95      99.97      PASS
api-gateway          p99 Latency     500        320        PASS
user-db              Availability    99.99      99.98      FAIL
cdn                  Availability    99.9       99.95      PASS

Common Mistakes

  1. Setting SLOs too tight (99.999%): Five nines means 5 minutes of downtime per year. Unless you’re a critical infrastructure provider, 99.9% (8.7 hours/year) is realistic. Start there.

  2. Measuring the wrong SLIs: Internal metrics like database CPU usage are not SLIs. SLIs must measure what users experience — request success rate and response time.

  3. Not using error budgets for deployment decisions: If the budget is full, teams should feel safe deploying. If it’s exhausted, they should focus on stability. Without this feedback, SLOs are just numbers on a dashboard.

  4. Only using one time window for alerts: A single 30-day window doesn’t catch fast burn rates. Multi-window alerts catch both sudden spikes (1h window) and gradual degradation (6h window).

  5. Confusing SLAs with SLOs: SLAs are legal contracts with penalties. SLOs are internal targets. Don’t commit to SLAs that are tighter than your SLOs — you need buffer room.

Practice Questions

  1. What is an SLI and what are the most common types? Answer: An SLI is a Service Level Indicator — a measurement of service quality. Common types: availability (success rate), latency (response time), error rate, throughput, freshness.

  2. How do SLOs relate to error budgets? Answer: Error budget = 100% - SLO target. It represents how much unreliability is acceptable. If SLO is 99.9%, the error budget is 0.1% (8.7 hours in 30 days).

  3. What is multi-window multi-burn-rate alerting? Answer: An alerting strategy using multiple time windows (1h, 6h) to detect different burn rates. Fast burn (14x) triggers a page. Slow burn (3x) creates a ticket. This catches both sudden and gradual budget consumption.

  4. Why should SLOs be tighter than SLAs? Answer: The SLO-SLA buffer protects against measurement errors, edge cases, and reporting delays. If SLA is 99.9% and you barely meet it, you risk violations during outage recovery.

Challenge

Define SLIs, SLOs, and error budgets for an e-commerce platform: identify 3 user-facing SLIs (checkout success rate, product page latency, search availability), set realistic SLO targets, create Grafana dashboard panels for each SLI, configure multi-window burn-rate alerts in Prometheus, and build a weekly SLO compliance report.

FAQ

What is the difference between SLI and SLO?
: SLI is the measurement (what). SLO is the target (how good). Example: “p99 latency” is an SLI. “p99 latency < 500ms” is an SLO.
How many SLIs should each service have?
: 3-5 per service. Too few misses important aspects. Too many creates overhead and alert fatigue. Focus on availability, latency, and one service-specific SLI.
What happens when an error budget is exhausted?
: Teams should stop shipping new features and focus on reliability improvements. Some organizations automatically block deployments when the budget is exhausted.
Can SLOs change over time?
: Yes. Start with looser SLOs and tighten as reliability improves. Revisit SLOs quarterly. Don’t change them reactively during incidents.
What is the cost of tracking SLIs?
: Minimal with Prometheus and Grafana — they’re free. The cost is the engineering time to define, measure, and respond to SLO breaches. This is an investment, not an expense.

Mini Project: SLO Dashboard Generator

# slo_dashboard_generator.py
import json

def generate_grafana_panel(title, expr, slo_target, unit="percent"):
    return {
        "title": title,
        "type": "graph",
        "targets": [{"expr": expr, "legendFormat": "{{ instance }}"}],
        "thresholds": [
            {"value": slo_target - 0.1, "color": "green", "op": "gt"},
            {"value": slo_target, "color": "yellow", "op": "gt"},
        ],
    }

dashboard = {
    "title": "SLO Overview",
    "panels": [
        generate_grafana_panel("Availability",
            'avg(sum(rate(http_requests_total{status!~"5.."}[5m])) '
            '/ sum(rate(http_requests_total[5m]))) * 100', 99.9),
        generate_grafana_panel("p99 Latency",
            'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))',
            0.5, "seconds"),
    ]
}
print(json.dumps(dashboard, indent=2))

What’s Next

TopicDescription
SRE Guide
SRE practices and implementation
Chaos Engineering
Testing resilience

Related topics: SRE, Prometheus, Grafana, SLI

What’s Next

Congratulations on completing this SLIs/SLOs/SLAs tutorial! Here’s where to go from here:

  • Practice daily — Define one SLI for your most critical service today
  • Build a project — Create an SLO dashboard with error budget tracking
  • Explore related topics — Check out SRE and incident response practices

Remember: every expert was once a beginner. Keep coding!

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro