Skip to content

Как настроить Prometheus alerting

Коротко:

Prometheus alerting: (1) Define alert rules в Prometheus rules.yaml (PromQL expressions), (2) Prometheus sends firing alerts → Alertmanager, (3) Alertmanager deduplicates + routes к receivers (PagerDuty/Slack/Email), (4) Inhibition rules suppress noisy children. 2026: move to burn-rate alerts вместо threshold-based. Integration с PagerDuty / Opsgenie для on-call rotation.

Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.

Попробовать бесплатно →

Пошаговая настройка

  1. Prometheus rules file: promlq expression + for: 5m duration
  2. Alertmanager config: receivers (PagerDuty/Slack) + routing rules
  3. Start Alertmanager: docker run -p 9093:9093 prom/alertmanager
  4. Prometheus config: alerting.alertmanagers: [{ static_configs: [{ targets: [alertmanager:9093] }] }]
  5. Test: trigger alert manually, verify arrived в Slack/PagerDuty
  6. Inhibition: suppress child alerts когда parent fires
  7. Silences: mute alerts во время planned maintenance

Рабочие примеры

СценарийКонфиг
Alert rule (PromQL)# rules.yaml groups: - name: api rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 10m labels: severity: critical annotations: summary: 'Error rate > 5% on {{ $labels.service }}' runbook: https://wiki.internal/runbooks/high-errors
Alertmanager config# alertmanager.yml route: receiver: slack-default routes: - match: { severity: critical } receiver: pagerduty - match: { team: payments } receiver: slack-payments receivers: - name: pagerduty pagerduty_configs: - routing_key: ${PD_KEY} - name: slack-default slack_configs: - api_url: ${SLACK_URL} channel: '#alerts'
Burn-rate alert (SRE style)- alert: SLOBurnRateFast # Fast burn: 14.4x × 99.9% error rate in 5m expr: (1 - availability_sli) > (14.4 * 0.001) for: 2m - alert: SLOBurnRateSlow # Slow burn: 3x × 99.9% in 6h expr: (1 - availability_sli) > (3 * 0.001) for: 1h
Inhibition# Если cluster down, suppress per-pod alerts inhibit_rules: - source_match: alertname: ClusterDown target_match: alertname: PodCrashLooping equal: [cluster]
Silence во время deploy# CLI $ amtool silence add \ --alertmanager.url http://localhost:9093 \ --duration=30m \ --comment='Deploy v2.3' \ service=api

Типичные ошибки

  • Alert fatigue: 100+ alerts/day → SRE ignores all. Consolidate, inhibit, use burn-rate
  • No runbook URL в annotation — responder wastes time. Link к wiki/Notion always
  • for: duration too short → flapping. 5-10 мин для transient issues
  • Email-only routing — SRE misses while sleeping. PagerDuty для critical
  • Not tested silence before deploy → alerts fire during planned work. Test procedures

Больше по теме

Часто задаваемые вопросы

PagerDuty vs Opsgenie?

PagerDuty: market leader, polished UX, $21+/user. Opsgenie (Atlassian): cheaper, tight Jira integration. Для small teams — Pagerduty free tier 5 users.

Alertmanager HA?

Clustered mode: 3+ instances gossip state. Без HA — если Alertmanager down → пропущенные alerts. Run 3 replicas.

Grafana alerting replace?

Grafana 9+ имеет built-in alerting (Unified alerts). Для Grafana Cloud users — simpler. Prometheus + AM всё равно standard для self-host.

Enterno integration?

<a href="/monitors">Enterno uptime monitoring</a> sends к PagerDuty, Slack, Telegram. Для OpenTelemetry-based alerts — Grafana Alerting lepszy.