Skip to content
← All articles

On-Call Best Practices

Short answer. Healthy on-call rests on three pillars: a reasonable load (no more than one meaningful incident per shift), clear escalation (who is next if the first responder does not answer) and a war on alert noise. The responder should react only to things that genuinely need immediate human action — otherwise burnout sets in and people start ignoring the signals.

What makes on-call healthy

On-call is not a punishment but part of engineering culture. A good process respects a person's time and sleep.

If a responder is woken for an alert they can do nothing about right now, that is not an alert — it is noise. Noise destroys trust in monitoring.

Rotation and load

  • At least 6–8 people in rotation so a shift lands no more than once every 1.5–2 months.
  • Seven-day shifts — weekly rotation handles context handover better than daily.
  • Compensation — time off or extra pay for night pages.
  • No more than 2 pages per shift as a health benchmark.

Escalation levels

Escalation guarantees an incident does not get stuck on an unavailable person. A baseline template:

Escalation policy "production-api":

Level 1: On-call engineer
  → wait 5 minutes for ack

Level 2 (no ack in 5 min): Secondary on-call
  → wait 5 minutes for ack

Level 3 (no ack in 10 min): Team lead + manager
  → notify leadership, declare a major incident

Channels: push → SMS → phone call (increasing urgency)

Fighting alert fatigue

Alert fatigue is the main enemy of on-call. Cut noise systematically:

  1. Every alert must be actionable — demanding a concrete human action.
  2. Group related alerts into one incident instead of a hundred notifications.
  3. Alerts that "can wait until morning" go to tickets, not to a pager.
  4. Review regularly: remove alerts that never once led to action.

More in our alerting best practices.

What a responder should have on hand

ArtifactWhy
RunbookStep-by-step actions for common failures
DashboardsQuick view of the golden signals
Escalation contactsWho to call if you cannot cope
AccessLogs, prod, kill-switch — granted in advance

For how to build a runbook, see our runbook guide.

How enterno.io supports on-call

enterno.io delivers alerts across several channels: Telegram, Slack, email, webhook, plus direct integration with PagerDuty and Jira where your escalation policies already live. External (synthetic) HTTP, SSL, Ping and DNS checks run every minute or every 30 seconds, multi-region from Russia, Europe and the US — which reduces false pages caused by local network glitches.

Spin up monitors, publish a status page for transparency, and enable heartbeat for cron and background jobs.

FAQ

What is the minimum number of people for a rotation?

A genuinely sustainable rotation starts at 6 people. Fewer, and shifts land too often, which leads to burnout.

How do I reduce night pages?

Make alerts actionable, group them, and route non-urgent ones to tickets. Multi-region checks reduce false pages.

Do I need a secondary on-call?

Yes, at least as an escalation tier. If the primary responder does not answer within 5 minutes, the incident should automatically move on.

Who owns alert quality?

The team that owns the service. Reviewing noisy alerts at the end-of-shift retrospective works well.

Set up reliable alert delivery. Connect channels and monitors at enterno.io/monitors so responders only get the signals that matter.

Check your website right now

Check your site →
More articles: Monitoring
Monitoring
Cloudflare Error 1020 Access Denied Fix
23.06.2026 · 29 views
Monitoring
Website Monitoring and Russian Data Law (152-FZ): What Matters
15.06.2026 · 48 views
Monitoring
Webhook Monitoring Guide
18.06.2026 · 45 views
Monitoring
Multi-Region Monitoring: Checks from RU/EU/US
15.06.2026 · 38 views