Skip to content

Как мерить SLI/SLO

Коротко:

SLI (Service Level Indicator) — что меряем: latency p99 < 200ms, error rate < 0.1%, availability > 99.9%. SLO (Service Level Objective) — target над SLI. Error Budget = 1 - SLO (99.9% → 0.1% = 43 мин/мес downtime allowed). Когда budget exhausted — freeze features, focus on reliability. Tools: SLOconf (Nobl9), Grafana SLO, OpenSLO spec.

Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.

Попробовать бесплатно →

Пошаговая настройка

  1. Identify critical user journeys (login, checkout, core API)
  2. Define SLI per journey: latency p99, error rate, throughput
  3. Set SLO realistic — start с 99% (3.65d downtime/yr), grow to 99.9%
  4. Instrument с Prometheus metrics или OpenTelemetry
  5. Calculate error budget consumption: actual errors / allowed
  6. Alerting: burn-rate alert — не threshold-based, а "at this rate budget gone in 6h"
  7. Monthly review: miss SLO → postmortem + feature freeze

Рабочие примеры

СценарийКонфиг
Prometheus SLO rulesgroups: - name: api_slo rules: - record: api_availability_sli expr: | sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) - alert: SLOBurnRate6h expr: (1 - api_availability_sli) > (14.4 * (1 - 0.999)) for: 5m # burn 6-hour budget
OpenSLO specapiVersion: openslo/v1 kind: SLO metadata: { name: api-availability } spec: description: 99.9% success rate for API service: api-gateway indicator: metricSource: type: prometheus spec: good: sum(rate(http_requests_total{code!~"5.."}[5m])) total: sum(rate(http_requests_total[5m])) objectives: - displayName: 99.9% уверенный target: 0.999 timeWindow: [{ rolling: { count: 28, unit: Day } }]
Latency SLI (p99)# Prometheus histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]) ) < 0.2 # 200ms
Burn-rate multi-window alert# Fast burn (14.4x rate — exhaust 30d budget в 2d) error_rate > 14.4 * (1 - 0.999) for 1h # AND error_rate > 14.4 * (1 - 0.999) for 5m # → page SRE
Grafana SLO panel# Using Grafana SLO plugin # Panel type: SLO # Time window: 28d rolling # Good events: rate(http_requests{code!~"5.."}[1h]) # Total events: rate(http_requests[1h]) # Shows: SLI current, error budget remaining, burn rate

Типичные ошибки

  • Too tight SLO (99.999% = 26s/мес) → impossible для small team. Start 99% and tighten
  • SLO без user impact — meaningless. "CPU < 80%" is not SLO, "checkout success > 99.9%" is
  • Alert threshold-based (error > 1%) — noisy. Burn-rate alerts better
  • Ignoring error budget в planning — deploy huge changes when budget exhausted = outage
  • Monitoring только uptime (UP/DOWN ping) — misses latency + partial degradation

Больше по теме

Часто задаваемые вопросы

SLO vs SLA?

SLO: internal target. SLA: legal contract с penalty. SLO всегда строже SLA (99.9% SLO → 99% SLA). Breach SLO → postmortem. Breach SLA → refund.

Реальные numbers?

AWS S3: 99.9%. Gmail: 99.97%. Stripe API: 99.99%. Для startup: 99% enough для MVP; 99.9% для B2B; 99.95%+ для critical infra.

Error budget ratio?

99.9% = 0.1% = 43m/мес. 99.99% = 4m/мес. Each 9 multiplies cost ~10x. Be realistic.

Enterno monitoring?

<a href="/monitors">Enterno Uptime Monitoring</a> tracks availability + latency + SSL. Alerts + SLA reports. Integrates с PagerDuty, Slack, Telegram.