Short answer. A runbook is a step-by-step playbook a responder follows to fix a specific failure without learning the system from scratch at three in the morning. A good runbook starts from a symptom ("alert X fired"), gives checks to confirm it, concrete commands to fix it and a condition for when to escalate. The goal is to turn panic into a checklist.
Why a runbook matters
During an incident a responder has no time to read architecture docs. A runbook provides a ready-made algorithm for the most common scenarios.
A runbook is not a substitute for understanding the system but a way to act fast and consistently when there is no time to think. The best runbook reads in a minute and applies without questions.
The structure of a good runbook
- Symptom / trigger — which alert or complaint starts this runbook.
- Diagnosis — how to confirm this is the right problem.
- Remediation — concrete steps and commands.
- Verification — how to confirm it is fixed.
- Escalation — when and to whom to hand it off.
Runbook template
RUNBOOK: High 5xx error rate on the API
== SYMPTOM ==
Alert "api-5xx-rate" > 5% over 5 minutes.
== DIAGNOSIS ==
1. Check the golden-signals dashboard (latency, errors).
2. See whether a deploy happened:
kubectl rollout history deployment/api
3. Check pod state:
kubectl get pods -l app=api
== REMEDIATION ==
IF the cause is a recent deploy:
kubectl rollout undo deployment/api
IF the cause is DB overload:
# raise replicas until recovery
kubectl scale deployment/api --replicas=2
== VERIFICATION ==
1. The 5xx ratio is back under 1% over 5 minutes.
2. p99 latency is normal.
== ESCALATION ==
If not stabilised within 15 minutes — call the team lead (L2),
declare a major incident, start a postmortem.
A checklist for every runbook
- Does it start from a concrete symptom or alert?
- Can the steps be run without knowing the whole system?
- Are all commands current and tested?
- Is there an explicit escalation condition?
- Does it say how to verify the fix?
Common mistakes
| Mistake | How to fix |
|---|---|
| Stale commands | Review after every infrastructure change |
| Steps too vague | Concrete commands instead of "restart the service" |
| No escalation condition | Explicit timeout and L2 contact |
| A 10-page runbook | One scenario — one short runbook |
How enterno.io fits into a runbook
Many runbooks start with "confirm the service is really down from the outside". As external (synthetic) monitoring, enterno.io answers that: HTTP, SSL, Ping and DNS checks from Russia, Europe and the US show whether the problem is visible to users or just a local glitch. Alerts arrive via Telegram, Slack, email, webhook, PagerDuty and Jira — right at the start of the runbook.
Connect monitors to confirm the outage, publish a status page, and enable heartbeat for cron and queues. For shifts, see our on-call article.
FAQ
How is a runbook different from a postmortem?
A runbook is the instruction before and during an incident — how to fix it. A postmortem is the review after — why it happened and how to avoid a repeat.
How many runbooks does a team need?
One per common actionable alert. Many short runbooks beat a single huge document.
How do I keep runbooks current?
Review after every incident and every infrastructure change. A stale runbook is more dangerous than none.
Can runbooks be automated?
Yes, partly: repetitive steps become scripts, while the runbook leaves context-dependent decisions to the human.
Add an external check to your runbooks. Create monitors at enterno.io/monitors so the first diagnostic step takes seconds.