Skip to content
← All articles

Incident Runbook Guide

Short answer. A runbook is a step-by-step playbook a responder follows to fix a specific failure without learning the system from scratch at three in the morning. A good runbook starts from a symptom ("alert X fired"), gives checks to confirm it, concrete commands to fix it and a condition for when to escalate. The goal is to turn panic into a checklist.

Why a runbook matters

During an incident a responder has no time to read architecture docs. A runbook provides a ready-made algorithm for the most common scenarios.

A runbook is not a substitute for understanding the system but a way to act fast and consistently when there is no time to think. The best runbook reads in a minute and applies without questions.

The structure of a good runbook

  • Symptom / trigger — which alert or complaint starts this runbook.
  • Diagnosis — how to confirm this is the right problem.
  • Remediation — concrete steps and commands.
  • Verification — how to confirm it is fixed.
  • Escalation — when and to whom to hand it off.

Runbook template

RUNBOOK: High 5xx error rate on the API

== SYMPTOM ==
Alert "api-5xx-rate" > 5% over 5 minutes.

== DIAGNOSIS ==
1. Check the golden-signals dashboard (latency, errors).
2. See whether a deploy happened:
     kubectl rollout history deployment/api
3. Check pod state:
     kubectl get pods -l app=api

== REMEDIATION ==
IF the cause is a recent deploy:
     kubectl rollout undo deployment/api
IF the cause is DB overload:
     # raise replicas until recovery
     kubectl scale deployment/api --replicas=2

== VERIFICATION ==
1. The 5xx ratio is back under 1% over 5 minutes.
2. p99 latency is normal.

== ESCALATION ==
If not stabilised within 15 minutes — call the team lead (L2),
declare a major incident, start a postmortem.

A checklist for every runbook

  1. Does it start from a concrete symptom or alert?
  2. Can the steps be run without knowing the whole system?
  3. Are all commands current and tested?
  4. Is there an explicit escalation condition?
  5. Does it say how to verify the fix?

Common mistakes

MistakeHow to fix
Stale commandsReview after every infrastructure change
Steps too vagueConcrete commands instead of "restart the service"
No escalation conditionExplicit timeout and L2 contact
A 10-page runbookOne scenario — one short runbook

How enterno.io fits into a runbook

Many runbooks start with "confirm the service is really down from the outside". As external (synthetic) monitoring, enterno.io answers that: HTTP, SSL, Ping and DNS checks from Russia, Europe and the US show whether the problem is visible to users or just a local glitch. Alerts arrive via Telegram, Slack, email, webhook, PagerDuty and Jira — right at the start of the runbook.

Connect monitors to confirm the outage, publish a status page, and enable heartbeat for cron and queues. For shifts, see our on-call article.

FAQ

How is a runbook different from a postmortem?

A runbook is the instruction before and during an incident — how to fix it. A postmortem is the review after — why it happened and how to avoid a repeat.

How many runbooks does a team need?

One per common actionable alert. Many short runbooks beat a single huge document.

How do I keep runbooks current?

Review after every incident and every infrastructure change. A stale runbook is more dangerous than none.

Can runbooks be automated?

Yes, partly: repetitive steps become scripts, while the runbook leaves context-dependent decisions to the human.

Add an external check to your runbooks. Create monitors at enterno.io/monitors so the first diagnostic step takes seconds.

Check your website right now

Check your site →
More articles: Monitoring
Monitoring
Website Uptime and SLA: What 99.9% Really Means
13.03.2026 · 108 views
Monitoring
Top 10 Website Monitoring Services 2026: Features and Pricing Compared
01.04.2026 · 492 views
Monitoring
INP in Core Web Vitals: The 2026 Metric
15.06.2026 · 38 views
Monitoring
On-Call Best Practices
22.06.2026 · 25 views