SLI (Service Level Indicator) — что меряем: latency p99 < 200ms, error rate < 0.1%, availability > 99.9%. SLO (Service Level Objective) — target над SLI. Error Budget = 1 - SLO (99.9% → 0.1% = 43 мин/мес downtime allowed). Когда budget exhausted — freeze features, focus on reliability. Tools: SLOconf (Nobl9), Grafana SLO, OpenSLO spec.
Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.
| Сценарий | Конфиг |
|---|---|
| Prometheus SLO rules | groups:
- name: api_slo
rules:
- record: api_availability_sli
expr: |
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- alert: SLOBurnRate6h
expr: (1 - api_availability_sli) > (14.4 * (1 - 0.999))
for: 5m # burn 6-hour budget |
| OpenSLO spec | apiVersion: openslo/v1
kind: SLO
metadata: { name: api-availability }
spec:
description: 99.9% success rate for API
service: api-gateway
indicator:
metricSource:
type: prometheus
spec:
good: sum(rate(http_requests_total{code!~"5.."}[5m]))
total: sum(rate(http_requests_total[5m]))
objectives:
- displayName: 99.9% уверенный
target: 0.999
timeWindow: [{ rolling: { count: 28, unit: Day } }] |
| Latency SLI (p99) | # Prometheus
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) < 0.2 # 200ms |
| Burn-rate multi-window alert | # Fast burn (14.4x rate — exhaust 30d budget в 2d)
error_rate > 14.4 * (1 - 0.999) for 1h
# AND
error_rate > 14.4 * (1 - 0.999) for 5m
# → page SRE |
| Grafana SLO panel | # Using Grafana SLO plugin
# Panel type: SLO
# Time window: 28d rolling
# Good events: rate(http_requests{code!~"5.."}[1h])
# Total events: rate(http_requests[1h])
# Shows: SLI current, error budget remaining, burn rate |
SLO: internal target. SLA: legal contract с penalty. SLO всегда строже SLA (99.9% SLO → 99% SLA). Breach SLO → postmortem. Breach SLA → refund.
AWS S3: 99.9%. Gmail: 99.97%. Stripe API: 99.99%. Для startup: 99% enough для MVP; 99.9% для B2B; 99.95%+ для critical infra.
99.9% = 0.1% = 43m/мес. 99.99% = 4m/мес. Each 9 multiplies cost ~10x. Be realistic.
<a href="/monitors">Enterno Uptime Monitoring</a> tracks availability + latency + SSL. Alerts + SLA reports. Integrates с PagerDuty, Slack, Telegram.