SLI (Service Level Indicator) — what you measure: latency p99 < 200ms, error rate < 0.1%, availability > 99.9%. SLO (Service Level Objective) — target on the SLI. Error Budget = 1 - SLO (99.9% → 0.1% = 43 min/mo downtime allowed). When budget exhausted — freeze features, focus on reliability. Tools: SLOconf (Nobl9), Grafana SLO, OpenSLO spec.
Below: step-by-step, working examples, common pitfalls, FAQ.
| Scenario | Config |
|---|---|
| Prometheus SLO rules | groups:
- name: api_slo
rules:
- record: api_availability_sli
expr: |
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- alert: SLOBurnRate6h
expr: (1 - api_availability_sli) > (14.4 * (1 - 0.999))
for: 5m # burn 6-hour budget |
| OpenSLO spec | apiVersion: openslo/v1
kind: SLO
metadata: { name: api-availability }
spec:
description: 99.9% success rate for API
service: api-gateway
indicator:
metricSource:
type: prometheus
spec:
good: sum(rate(http_requests_total{code!~"5.."}[5m]))
total: sum(rate(http_requests_total[5m]))
objectives:
- displayName: 99.9% confident
target: 0.999
timeWindow: [{ rolling: { count: 28, unit: Day } }] |
| Latency SLI (p99) | # Prometheus
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) < 0.2 # 200ms |
| Burn-rate multi-window alert | # Fast burn (14.4x rate — exhaust 30d budget in 2d)
error_rate > 14.4 * (1 - 0.999) for 1h
# AND
error_rate > 14.4 * (1 - 0.999) for 5m
# → page SRE |
| Grafana SLO panel | # Using Grafana SLO plugin
# Panel type: SLO
# Time window: 28d rolling
# Good events: rate(http_requests{code!~"5.."}[1h])
# Total events: rate(http_requests[1h])
# Shows: current SLI, error budget remaining, burn rate |
SLO: internal target. SLA: legal contract with penalty. SLO always stricter than SLA (99.9% SLO → 99% SLA). Breach SLO → postmortem. Breach SLA → refund.
AWS S3: 99.9%. Gmail: 99.97%. Stripe API: 99.99%. For startup: 99% enough for MVP; 99.9% for B2B; 99.95%+ for critical infra.
99.9% = 0.1% = 43m/mo. 99.99% = 4m/mo. Each 9 multiplies cost ~10x. Be realistic.
<a href="/en/monitors">Enterno Uptime Monitoring</a> tracks availability + latency + SSL. Alerts + SLA reports. Integrates with PagerDuty, Slack, Telegram.