Prometheus alerting: (1) Define alert rules in Prometheus rules.yaml (PromQL expressions), (2) Prometheus sends firing alerts → Alertmanager, (3) Alertmanager deduplicates + routes to receivers (PagerDuty/Slack/Email), (4) Inhibition rules suppress noisy children. 2026: move to burn-rate alerts instead of threshold-based. Integration with PagerDuty / Opsgenie for on-call rotation.
Below: step-by-step, working examples, common pitfalls, FAQ.
for: 5m durationdocker run -p 9093:9093 prom/alertmanageralerting.alertmanagers: [{ static_configs: [{ targets: [alertmanager:9093] }] }]| Scenario | Config |
|---|---|
| Alert rule (PromQL) | # rules.yaml
groups:
- name: api
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 10m
labels:
severity: critical
annotations:
summary: 'Error rate > 5% on {{ $labels.service }}'
runbook: https://wiki.internal/runbooks/high-errors |
| Alertmanager config | # alertmanager.yml
route:
receiver: slack-default
routes:
- match: { severity: critical }
receiver: pagerduty
- match: { team: payments }
receiver: slack-payments
receivers:
- name: pagerduty
pagerduty_configs:
- routing_key: ${PD_KEY}
- name: slack-default
slack_configs:
- api_url: ${SLACK_URL}
channel: '#alerts' |
| Burn-rate alert (SRE style) | - alert: SLOBurnRateFast
# Fast burn: 14.4x × 99.9% error rate in 5m
expr: (1 - availability_sli) > (14.4 * 0.001)
for: 2m
- alert: SLOBurnRateSlow
# Slow burn: 3x × 99.9% in 6h
expr: (1 - availability_sli) > (3 * 0.001)
for: 1h |
| Inhibition | # If cluster down, suppress per-pod alerts
inhibit_rules:
- source_match:
alertname: ClusterDown
target_match:
alertname: PodCrashLooping
equal: [cluster] |
| Silence during deploy | # CLI
$ amtool silence add \
--alertmanager.url http://localhost:9093 \
--duration=30m \
--comment='Deploy v2.3' \
service=api |
for: duration too short → flapping. 5-10 min for transient issuesPagerDuty: market leader, polished UX, $21+/user. Opsgenie (Atlassian): cheaper, tight Jira integration. For small teams — PagerDuty free tier 5 users.
Clustered mode: 3+ instances gossip state. Without HA — if Alertmanager is down → missed alerts. Run 3 replicas.
Grafana 9+ has built-in alerting (Unified alerts). For Grafana Cloud users — simpler. Prometheus + AM still standard for self-host.
<a href="/en/monitors">Enterno uptime monitoring</a> sends to PagerDuty, Slack, Telegram. For OpenTelemetry-based alerts — Grafana Alerting better.