How to Measure SLI/SLO for a Service

Igor Verentsov

How to Measure SLI/SLO

By Igor Verentsov · Updated Apr 18, 2026

Key idea:

SLI (Service Level Indicator) — what you measure: latency p99 < 200ms, error rate < 0.1%, availability > 99.9%. SLO (Service Level Objective) — target on the SLI. Error Budget = 1 - SLO (99.9% → 0.1% = 43 min/mo downtime allowed). When budget exhausted — freeze features, focus on reliability. Tools: SLOconf (Nobl9), Grafana SLO, OpenSLO spec.

Below: step-by-step, working examples, common pitfalls, FAQ.

Free online tool — HTTP header checker: instant results, no signup.

Check your site →

Step-by-Step Setup

Identify critical user journeys (login, checkout, core API)
Define SLI per journey: latency p99, error rate, throughput
Set SLO realistically — start with 99% (3.65d downtime/yr), grow to 99.9%
Instrument with Prometheus metrics or OpenTelemetry
Calculate error budget consumption: actual errors / allowed
Alerting: burn-rate alert — not threshold-based, but "at this rate budget gone in 6h"
Monthly review: miss SLO → postmortem + feature freeze

Working Examples

Scenario	Config
Prometheus SLO rules	`groups: - name: api_slo rules: - record: api_availability_sli expr: \| sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) - alert: SLOBurnRate6h expr: (1 - api_availability_sli) > (14.4 * (1 - 0.999)) for: 5m # burn 6-hour budget`
OpenSLO spec	`apiVersion: openslo/v1 kind: SLO metadata: { name: api-availability } spec: description: 99.9% success rate for API service: api-gateway indicator: metricSource: type: prometheus spec: good: sum(rate(http_requests_total{code!~"5.."}[5m])) total: sum(rate(http_requests_total[5m])) objectives: - displayName: 99.9% confident target: 0.999 timeWindow: [{ rolling: { count: 28, unit: Day } }]`
Latency SLI (p99)	`# Prometheus histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]) ) < 0.2 # 200ms`
Burn-rate multi-window alert	`# Fast burn (14.4x rate — exhaust 30d budget in 2d) error_rate > 14.4 * (1 - 0.999) for 1h # AND error_rate > 14.4 * (1 - 0.999) for 5m # → page SRE`
Grafana SLO panel	`# Using Grafana SLO plugin # Panel type: SLO # Time window: 28d rolling # Good events: rate(http_requests{code!~"5.."}[1h]) # Total events: rate(http_requests[1h]) # Shows: current SLI, error budget remaining, burn rate`

Common Pitfalls

Too tight SLO (99.999% = 26s/mo) → impossible for a small team. Start at 99% and tighten
SLO without user impact — meaningless. "CPU < 80%" is not SLO, "checkout success > 99.9%" is
Threshold-based alert (error > 1%) — noisy. Burn-rate alerts better
Ignoring error budget in planning — deploying huge changes when budget is exhausted = outage
Monitoring only uptime (UP/DOWN ping) — misses latency + partial degradation

Learn more

How-to

Glossary

What is CDC (Change Data Capture)

Research

Frequently Asked Questions

SLO vs SLA?

SLO: internal target. SLA: legal contract with penalty. SLO always stricter than SLA (99.9% SLO → 99% SLA). Breach SLO → postmortem. Breach SLA → refund.

Realistic numbers?

AWS S3: 99.9%. Gmail: 99.97%. Stripe API: 99.99%. For startup: 99% enough for MVP; 99.9% for B2B; 99.95%+ for critical infra.

Error budget ratio?

99.9% = 0.1% = 43m/mo. 99.99% = 4m/mo. Each 9 multiplies cost ~10x. Be realistic.

Enterno monitoring?

<a href="/en/monitors">Enterno Uptime Monitoring</a> tracks availability + latency + SSL. Alerts + SLA reports. Integrates with PagerDuty, Slack, Telegram.

Try the live tool that powered this guide

Free plan — 10 monitors, checks every 5 min, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing