Skip to content

How to Measure SLI/SLO

Key idea:

SLI (Service Level Indicator) — what you measure: latency p99 < 200ms, error rate < 0.1%, availability > 99.9%. SLO (Service Level Objective) — target on the SLI. Error Budget = 1 - SLO (99.9% → 0.1% = 43 min/mo downtime allowed). When budget exhausted — freeze features, focus on reliability. Tools: SLOconf (Nobl9), Grafana SLO, OpenSLO spec.

Below: step-by-step, working examples, common pitfalls, FAQ.

Try it now — free →

Step-by-Step Setup

  1. Identify critical user journeys (login, checkout, core API)
  2. Define SLI per journey: latency p99, error rate, throughput
  3. Set SLO realistically — start with 99% (3.65d downtime/yr), grow to 99.9%
  4. Instrument with Prometheus metrics or OpenTelemetry
  5. Calculate error budget consumption: actual errors / allowed
  6. Alerting: burn-rate alert — not threshold-based, but "at this rate budget gone in 6h"
  7. Monthly review: miss SLO → postmortem + feature freeze

Working Examples

ScenarioConfig
Prometheus SLO rulesgroups: - name: api_slo rules: - record: api_availability_sli expr: | sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) - alert: SLOBurnRate6h expr: (1 - api_availability_sli) > (14.4 * (1 - 0.999)) for: 5m # burn 6-hour budget
OpenSLO specapiVersion: openslo/v1 kind: SLO metadata: { name: api-availability } spec: description: 99.9% success rate for API service: api-gateway indicator: metricSource: type: prometheus spec: good: sum(rate(http_requests_total{code!~"5.."}[5m])) total: sum(rate(http_requests_total[5m])) objectives: - displayName: 99.9% confident target: 0.999 timeWindow: [{ rolling: { count: 28, unit: Day } }]
Latency SLI (p99)# Prometheus histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]) ) < 0.2 # 200ms
Burn-rate multi-window alert# Fast burn (14.4x rate — exhaust 30d budget in 2d) error_rate > 14.4 * (1 - 0.999) for 1h # AND error_rate > 14.4 * (1 - 0.999) for 5m # → page SRE
Grafana SLO panel# Using Grafana SLO plugin # Panel type: SLO # Time window: 28d rolling # Good events: rate(http_requests{code!~"5.."}[1h]) # Total events: rate(http_requests[1h]) # Shows: current SLI, error budget remaining, burn rate

Common Pitfalls

  • Too tight SLO (99.999% = 26s/mo) → impossible for a small team. Start at 99% and tighten
  • SLO without user impact — meaningless. "CPU < 80%" is not SLO, "checkout success > 99.9%" is
  • Threshold-based alert (error > 1%) — noisy. Burn-rate alerts better
  • Ignoring error budget in planning — deploying huge changes when budget is exhausted = outage
  • Monitoring only uptime (UP/DOWN ping) — misses latency + partial degradation

Learn more

Frequently Asked Questions

SLO vs SLA?

SLO: internal target. SLA: legal contract with penalty. SLO always stricter than SLA (99.9% SLO → 99% SLA). Breach SLO → postmortem. Breach SLA → refund.

Realistic numbers?

AWS S3: 99.9%. Gmail: 99.97%. Stripe API: 99.99%. For startup: 99% enough for MVP; 99.9% for B2B; 99.95%+ for critical infra.

Error budget ratio?

99.9% = 0.1% = 43m/mo. 99.99% = 4m/mo. Each 9 multiplies cost ~10x. Be realistic.

Enterno monitoring?

<a href="/en/monitors">Enterno Uptime Monitoring</a> tracks availability + latency + SSL. Alerts + SLA reports. Integrates with PagerDuty, Slack, Telegram.