Short answer. Healthy on-call rests on three pillars: a reasonable load (no more than one meaningful incident per shift), clear escalation (who is next if the first responder does not answer) and a war on alert noise. The responder should react only to things that genuinely need immediate human action — otherwise burnout sets in and people start ignoring the signals.
What makes on-call healthy
On-call is not a punishment but part of engineering culture. A good process respects a person's time and sleep.
If a responder is woken for an alert they can do nothing about right now, that is not an alert — it is noise. Noise destroys trust in monitoring.
Rotation and load
- At least 6–8 people in rotation so a shift lands no more than once every 1.5–2 months.
- Seven-day shifts — weekly rotation handles context handover better than daily.
- Compensation — time off or extra pay for night pages.
- No more than 2 pages per shift as a health benchmark.
Escalation levels
Escalation guarantees an incident does not get stuck on an unavailable person. A baseline template:
Escalation policy "production-api":
Level 1: On-call engineer
→ wait 5 minutes for ack
Level 2 (no ack in 5 min): Secondary on-call
→ wait 5 minutes for ack
Level 3 (no ack in 10 min): Team lead + manager
→ notify leadership, declare a major incident
Channels: push → SMS → phone call (increasing urgency)
Fighting alert fatigue
Alert fatigue is the main enemy of on-call. Cut noise systematically:
- Every alert must be actionable — demanding a concrete human action.
- Group related alerts into one incident instead of a hundred notifications.
- Alerts that "can wait until morning" go to tickets, not to a pager.
- Review regularly: remove alerts that never once led to action.
More in our alerting best practices.
What a responder should have on hand
| Artifact | Why |
|---|---|
| Runbook | Step-by-step actions for common failures |
| Dashboards | Quick view of the golden signals |
| Escalation contacts | Who to call if you cannot cope |
| Access | Logs, prod, kill-switch — granted in advance |
For how to build a runbook, see our runbook guide.
How enterno.io supports on-call
enterno.io delivers alerts across several channels: Telegram, Slack, email, webhook, plus direct integration with PagerDuty and Jira where your escalation policies already live. External (synthetic) HTTP, SSL, Ping and DNS checks run every minute or every 30 seconds, multi-region from Russia, Europe and the US — which reduces false pages caused by local network glitches.
Spin up monitors, publish a status page for transparency, and enable heartbeat for cron and background jobs.
FAQ
What is the minimum number of people for a rotation?
A genuinely sustainable rotation starts at 6 people. Fewer, and shifts land too often, which leads to burnout.
How do I reduce night pages?
Make alerts actionable, group them, and route non-urgent ones to tickets. Multi-region checks reduce false pages.
Do I need a secondary on-call?
Yes, at least as an escalation tier. If the primary responder does not answer within 5 minutes, the incident should automatically move on.
Who owns alert quality?
The team that owns the service. Reviewing noisy alerts at the end-of-shift retrospective works well.
Set up reliable alert delivery. Connect channels and monitors at enterno.io/monitors so responders only get the signals that matter.