SLIs, SLOs, and SLAs: Measuring Reliability
Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) form the measurement framework for reliability — quantifying what “healthy” means for your systems and creating accountability.
What You’ll Learn
- Defining SLIs: latency, availability, error rate, throughput, freshness
- Setting SLO targets and managing error budgets
- Understanding SLAs and their relationship to SLOs
- Multi-window multi-burn-rate alerting for early detection
- Building reliability dashboards and reporting
Why SLIs, SLOs, and SLAs Matter
Without SLIs, you can’t measure reliability. Without SLOs, you don’t know if it’s good enough. Without SLAs, there’s no accountability. This framework transforms reliability from a vague aspiration into a measurable, actionable practice. DodaTech tracks SLIs for Durga Antivirus Pro’s signature update service — measuring update latency (p99 < 5s), availability (> 99.9%), and freshness (signatures updated within 1 hour of release) to ensure users receive timely protection.
flowchart LR
A[SRE & Monitoring] --> B[SLIs / SLOs / SLAs]
B --> C[SLIs - What We Measure]
B --> D[SLOs - Our Targets]
B --> E[SLAs - Customer Contracts]
C --> F[Latency, Availability, Error Rate]
D --> G[Error Budgets]
E --> H[Penalties / Commitments]
G --> I[Release Decisions]
style B fill:#38a169,color:#fff
Service Level Indicators (SLIs)
An SLI is a specific metric that measures one aspect of service quality. Each SLI must be measurable, meaningful to users, and actionable.
Common SLIs
| Category | SLI | Measurement |
|---|---|---|
| Availability | Request success rate | successful / total * 100 |
| Latency | Response time at percentile | p50, p95, p99 in ms |
| Error Rate | Failed requests | 5xx + 4xx / total * 100 |
| Throughput | Requests per second | requests / second |
| Freshness | Age of last data update | time() - last_update_time |
| Durability | Data loss rate | objects_lost / total_objects |
Measuring SLIs in Prometheus
# Availability SLI: percentage of non-5xx responses
avg(
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) * 100
# Latency SLI: p99 response time in seconds
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Error rate SLI: percentage of 5xx responses
avg(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) * 100
# Output: Single values representing current reliability
# Availability: 99.87%
# p99 Latency: 0.345s
# Error Rate: 0.13%Defining Custom SLIs in Code
# sli_calculator.py
from datetime import datetime, timedelta
import random
class SLICalculator:
def __init__(self, window_minutes=5):
self.window = timedelta(minutes=window_minutes)
self.requests = []
def record(self, success, latency_ms):
self.requests.append({
"timestamp": datetime.now(),
"success": success,
"latency": latency_ms,
})
def availability(self):
recent = [r for r in self.requests
if datetime.now() - r["timestamp"] < self.window]
if not recent:
return 100.0
successes = sum(1 for r in recent if r["success"])
return successes / len(recent) * 100
def p99_latency(self):
recent = sorted(
[r["latency"] for r in self.requests
if datetime.now() - r["timestamp"] < self.window]
)
if not recent:
return 0
idx = int(len(recent) * 0.99)
return recent[min(idx, len(recent) - 1)]
sli = SLICalculator()
for i in range(1000):
success = random.random() > 0.005 # 0.5% failure rate
latency = random.gauss(150, 50) # mean 150ms, stddev 50ms
sli.record(success, latency)
print(f"Availability SLI: {sli.availability():.2f}%")
print(f"p99 Latency SLI: {sli.p99_latency():.0f}ms")
# Output:
# Availability SLI: 99.50%
# p99 Latency SLI: 298msService Level Objectives (SLOs)
An SLO is a target value for an SLI over a time window. It drives engineering decisions.
| SLI | SLO Target | Window |
|---|---|---|
| Availability | ≥ 99.9% | 30 days |
| p99 Latency | ≤ 500ms | 7 days |
| Error Rate | ≤ 1% | 30 days |
Error Budget
Error budget = 100% - SLO target. It represents the acceptable amount of unreliability.
# error_budget_dashboard.py
class ErrorBudgetDashboard:
def __init__(self, sli_name, slo_pct, window_days):
self.sli_name = sli_name
self.slo = slo_pct
self.window = window_days
self.budget = 100 - slo_pct # Total budget
self.consumed = 0
def record_period(self, sli_value_pct):
error_rate = 100 - sli_value_pct
self.consumed += error_rate
remaining = max(0, self.budget - self.consumed)
print(f"Period SLI: {sli_value_pct:.2f}%")
print(f" Error this period: {error_rate:.2f}%")
print(f" Budget remaining: {remaining:.2f}% (consumed {self.consumed:.2f}% of {self.budget:.2f}%)")
return remaining
dash = ErrorBudgetDashboard("availability", 99.9, 30)
dash.record_period(99.95) # 0.05% error
dash.record_period(99.80) # 0.20% error
dash.record_period(99.50) # 0.50% error - budget criticalExpected output:
Period SLI: 99.95%
Error this period: 0.05%
Budget remaining: 0.05% (consumed 0.05% of 0.10%)
Period SLI: 99.80%
Error this period: 0.20%
Budget remaining: -0.15% (consumed 0.25% of 0.10%)
Period SLI: 99.50%
Error this period: 0.50%
Budget remaining: -0.65% (consumed 0.75% of 0.10%)Multi-Window Multi-Burn-Rate Alerts
The most effective SLO alerting approach uses multiple time windows:
groups:
- name: slo_alerts
rules:
# Page: burning budget fast (> 14x allowed burn rate)
- alert: SLOHighBurnRate
expr: |
(1 - (sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))))
* 100 > 14 * (100 - 99.9)
for: 5m
labels: { severity: critical }
annotations:
summary: "Error budget burning > 14x allowed rate"
# Ticket: burning budget moderately (> 3x allowed burn rate)
- alert: SLOMediumBurnRate
expr: |
(1 - (sum(rate(http_requests_total{status!~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))))
* 100 > 3 * (100 - 99.9)
for: 30m
labels: { severity: warning }
annotations:
summary: "Error budget burning > 3x allowed rate"Burn Rate Explained
A burn rate of 1x means you’ll exhaust the error budget over the full window. A burn rate of 14x means you’d exhaust the budget in 1/14th of the window — roughly 2 days for a 30-day window. This triggers a page.
SLAs
An SLA is a contractual commitment to customers, typically looser than your internal SLOs:
| Commitment | SLO (internal) | SLA (customer) |
|---|---|---|
| Availability | 99.95% | 99.9% |
| Latency | p99 < 300ms | p99 < 1s |
| Support response | 15 min SEV-1 | 1 hour SEV-1 |
Rule: Always set SLOs tighter than SLAs. If your SLA is 99.9%, set your SLO to 99.95%. The buffer (0.05%) is your safety margin.
Reporting
# slo_report.py
def generate_slo_report(services):
print("=== SLO Compliance Report ===")
print(f"{'Service':<20} {'SLI':<15} {'SLO':<10} {'Actual':<10} {'Status':<10}")
print("-" * 65)
for s in services:
compliant = s["actual"] >= s["slo"]
status = "PASS" if compliant else "FAIL"
print(f"{s['name']:<20} {s['sli']:<15} {s['slo']:<10} "
f"{s['actual']:<10} {status:<10}")
services = [
{"name": "api-gateway", "sli": "Availability", "slo": 99.95, "actual": 99.97},
{"name": "api-gateway", "sli": "p99 Latency", "slo": 500, "actual": 320},
{"name": "user-db", "sli": "Availability", "slo": 99.99, "actual": 99.98},
{"name": "cdn", "sli": "Availability", "slo": 99.9, "actual": 99.95},
]
generate_slo_report(services)Expected output:
=== SLO Compliance Report ===
Service SLI SLO Actual Status
-----------------------------------------------------------------
api-gateway Availability 99.95 99.97 PASS
api-gateway p99 Latency 500 320 PASS
user-db Availability 99.99 99.98 FAIL
cdn Availability 99.9 99.95 PASSCommon Mistakes
Setting SLOs too tight (99.999%): Five nines means 5 minutes of downtime per year. Unless you’re a critical infrastructure provider, 99.9% (8.7 hours/year) is realistic. Start there.
Measuring the wrong SLIs: Internal metrics like database CPU usage are not SLIs. SLIs must measure what users experience — request success rate and response time.
Not using error budgets for deployment decisions: If the budget is full, teams should feel safe deploying. If it’s exhausted, they should focus on stability. Without this feedback, SLOs are just numbers on a dashboard.
Only using one time window for alerts: A single 30-day window doesn’t catch fast burn rates. Multi-window alerts catch both sudden spikes (1h window) and gradual degradation (6h window).
Confusing SLAs with SLOs: SLAs are legal contracts with penalties. SLOs are internal targets. Don’t commit to SLAs that are tighter than your SLOs — you need buffer room.
Practice Questions
What is an SLI and what are the most common types? Answer: An SLI is a Service Level Indicator — a measurement of service quality. Common types: availability (success rate), latency (response time), error rate, throughput, freshness.
How do SLOs relate to error budgets? Answer: Error budget = 100% - SLO target. It represents how much unreliability is acceptable. If SLO is 99.9%, the error budget is 0.1% (8.7 hours in 30 days).
What is multi-window multi-burn-rate alerting? Answer: An alerting strategy using multiple time windows (1h, 6h) to detect different burn rates. Fast burn (14x) triggers a page. Slow burn (3x) creates a ticket. This catches both sudden and gradual budget consumption.
Why should SLOs be tighter than SLAs? Answer: The SLO-SLA buffer protects against measurement errors, edge cases, and reporting delays. If SLA is 99.9% and you barely meet it, you risk violations during outage recovery.
Challenge
Define SLIs, SLOs, and error budgets for an e-commerce platform: identify 3 user-facing SLIs (checkout success rate, product page latency, search availability), set realistic SLO targets, create Grafana dashboard panels for each SLI, configure multi-window burn-rate alerts in Prometheus, and build a weekly SLO compliance report.
FAQ
Mini Project: SLO Dashboard Generator
# slo_dashboard_generator.py
import json
def generate_grafana_panel(title, expr, slo_target, unit="percent"):
return {
"title": title,
"type": "graph",
"targets": [{"expr": expr, "legendFormat": "{{ instance }}"}],
"thresholds": [
{"value": slo_target - 0.1, "color": "green", "op": "gt"},
{"value": slo_target, "color": "yellow", "op": "gt"},
],
}
dashboard = {
"title": "SLO Overview",
"panels": [
generate_grafana_panel("Availability",
'avg(sum(rate(http_requests_total{status!~"5.."}[5m])) '
'/ sum(rate(http_requests_total[5m]))) * 100', 99.9),
generate_grafana_panel("p99 Latency",
'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))',
0.5, "seconds"),
]
}
print(json.dumps(dashboard, indent=2))What’s Next
| Topic | Description |
|---|---|
| SRE practices and implementation | |
| Testing resilience |
Related topics: SRE, Prometheus, Grafana, SLI
What’s Next
Congratulations on completing this SLIs/SLOs/SLAs tutorial! Here’s where to go from here:
- Practice daily — Define one SLI for your most critical service today
- Build a project — Create an SLO dashboard with error budget tracking
- Explore related topics — Check out SRE and incident response practices
Remember: every expert was once a beginner. Keep coding!
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro