On-Call Best Practices

Anatoly Oshmanovsky

Monitoring

On-Call Best Practices

Published: 22.06.2026 · ~3 min · 24 views

Short answer. Healthy on-call rests on three pillars: a reasonable load (no more than one meaningful incident per shift), clear escalation (who is next if the first responder does not answer) and a war on alert noise. The responder should react only to things that genuinely need immediate human action — otherwise burnout sets in and people start ignoring the signals.

What makes on-call healthy

On-call is not a punishment but part of engineering culture. A good process respects a person's time and sleep.

If a responder is woken for an alert they can do nothing about right now, that is not an alert — it is noise. Noise destroys trust in monitoring.

Rotation and load

At least 6–8 people in rotation so a shift lands no more than once every 1.5–2 months.
Seven-day shifts — weekly rotation handles context handover better than daily.
Compensation — time off or extra pay for night pages.
No more than 2 pages per shift as a health benchmark.

Escalation levels

Escalation guarantees an incident does not get stuck on an unavailable person. A baseline template:

Escalation policy "production-api":

Level 1: On-call engineer
  → wait 5 minutes for ack

Level 2 (no ack in 5 min): Secondary on-call
  → wait 5 minutes for ack

Level 3 (no ack in 10 min): Team lead + manager
  → notify leadership, declare a major incident

Channels: push → SMS → phone call (increasing urgency)

Fighting alert fatigue

Alert fatigue is the main enemy of on-call. Cut noise systematically:

Every alert must be actionable — demanding a concrete human action.
Group related alerts into one incident instead of a hundred notifications.
Alerts that "can wait until morning" go to tickets, not to a pager.
Review regularly: remove alerts that never once led to action.

What a responder should have on hand

Artifact	Why
Runbook	Step-by-step actions for common failures
Dashboards	Quick view of the golden signals
Escalation contacts	Who to call if you cannot cope
Access	Logs, prod, kill-switch — granted in advance

For how to build a runbook, see our runbook guide.

How enterno.io supports on-call

enterno.io delivers alerts across several channels: Telegram, Slack, email, webhook, plus direct integration with PagerDuty and Jira where your escalation policies already live. External (synthetic) HTTP, SSL, Ping and DNS checks run every minute or every 30 seconds, multi-region from Russia, Europe and the US — which reduces false pages caused by local network glitches.

Spin up monitors, publish a status page for transparency, and enable heartbeat for cron and background jobs.

FAQ

What is the minimum number of people for a rotation?

A genuinely sustainable rotation starts at 6 people. Fewer, and shifts land too often, which leads to burnout.

How do I reduce night pages?

Make alerts actionable, group them, and route non-urgent ones to tickets. Multi-region checks reduce false pages.

Do I need a secondary on-call?

Yes, at least as an escalation tier. If the primary responder does not answer within 5 minutes, the incident should automatically move on.

Who owns alert quality?

The team that owns the service. Reviewing noisy alerts at the end-of-shift retrospective works well.

Set up reliable alert delivery. Connect channels and monitors at enterno.io/monitors so responders only get the signals that matter.

Check your website right now

Check your site →

On-Call Best Practices

What makes on-call healthy

Rotation and load

Escalation levels

Fighting alert fatigue

What a responder should have on hand

How enterno.io supports on-call

FAQ

What is the minimum number of people for a rotation?

How do I reduce night pages?

Do I need a secondary on-call?

Who owns alert quality?

Start monitoring for free