Skip to content

How to Set Up Prometheus Alerting

Key idea:

Prometheus alerting: (1) Define alert rules in Prometheus rules.yaml (PromQL expressions), (2) Prometheus sends firing alerts → Alertmanager, (3) Alertmanager deduplicates + routes to receivers (PagerDuty/Slack/Email), (4) Inhibition rules suppress noisy children. 2026: move to burn-rate alerts instead of threshold-based. Integration with PagerDuty / Opsgenie for on-call rotation.

Below: step-by-step, working examples, common pitfalls, FAQ.

Try it now — free →

Step-by-Step Setup

  1. Prometheus rules file: PromQL expression + for: 5m duration
  2. Alertmanager config: receivers (PagerDuty/Slack) + routing rules
  3. Start Alertmanager: docker run -p 9093:9093 prom/alertmanager
  4. Prometheus config: alerting.alertmanagers: [{ static_configs: [{ targets: [alertmanager:9093] }] }]
  5. Test: trigger alert manually, verify arrived in Slack/PagerDuty
  6. Inhibition: suppress child alerts when parent fires
  7. Silences: mute alerts during planned maintenance

Working Examples

ScenarioConfig
Alert rule (PromQL)# rules.yaml groups: - name: api rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 10m labels: severity: critical annotations: summary: 'Error rate > 5% on {{ $labels.service }}' runbook: https://wiki.internal/runbooks/high-errors
Alertmanager config# alertmanager.yml route: receiver: slack-default routes: - match: { severity: critical } receiver: pagerduty - match: { team: payments } receiver: slack-payments receivers: - name: pagerduty pagerduty_configs: - routing_key: ${PD_KEY} - name: slack-default slack_configs: - api_url: ${SLACK_URL} channel: '#alerts'
Burn-rate alert (SRE style)- alert: SLOBurnRateFast # Fast burn: 14.4x × 99.9% error rate in 5m expr: (1 - availability_sli) > (14.4 * 0.001) for: 2m - alert: SLOBurnRateSlow # Slow burn: 3x × 99.9% in 6h expr: (1 - availability_sli) > (3 * 0.001) for: 1h
Inhibition# If cluster down, suppress per-pod alerts inhibit_rules: - source_match: alertname: ClusterDown target_match: alertname: PodCrashLooping equal: [cluster]
Silence during deploy# CLI $ amtool silence add \ --alertmanager.url http://localhost:9093 \ --duration=30m \ --comment='Deploy v2.3' \ service=api

Common Pitfalls

  • Alert fatigue: 100+ alerts/day → SRE ignores all. Consolidate, inhibit, use burn-rate
  • No runbook URL in annotation — responder wastes time. Link to wiki/Notion always
  • for: duration too short → flapping. 5-10 min for transient issues
  • Email-only routing — SRE misses while sleeping. PagerDuty for critical
  • Not testing silence before deploy → alerts fire during planned work. Test procedures

Learn more

Frequently Asked Questions

PagerDuty vs Opsgenie?

PagerDuty: market leader, polished UX, $21+/user. Opsgenie (Atlassian): cheaper, tight Jira integration. For small teams — PagerDuty free tier 5 users.

Alertmanager HA?

Clustered mode: 3+ instances gossip state. Without HA — if Alertmanager is down → missed alerts. Run 3 replicas.

Grafana alerting as replacement?

Grafana 9+ has built-in alerting (Unified alerts). For Grafana Cloud users — simpler. Prometheus + AM still standard for self-host.

Enterno integration?

<a href="/en/monitors">Enterno uptime monitoring</a> sends to PagerDuty, Slack, Telegram. For OpenTelemetry-based alerts — Grafana Alerting better.