Incident Runbook Guide

Anatoly Oshmanovsky

Monitoring

Incident Runbook Guide

Published: 22.06.2026 · ~3 min · 27 views

Short answer. A runbook is a step-by-step playbook a responder follows to fix a specific failure without learning the system from scratch at three in the morning. A good runbook starts from a symptom ("alert X fired"), gives checks to confirm it, concrete commands to fix it and a condition for when to escalate. The goal is to turn panic into a checklist.

Why a runbook matters

During an incident a responder has no time to read architecture docs. A runbook provides a ready-made algorithm for the most common scenarios.

A runbook is not a substitute for understanding the system but a way to act fast and consistently when there is no time to think. The best runbook reads in a minute and applies without questions.

The structure of a good runbook

Symptom / trigger — which alert or complaint starts this runbook.
Diagnosis — how to confirm this is the right problem.
Remediation — concrete steps and commands.
Verification — how to confirm it is fixed.
Escalation — when and to whom to hand it off.

Runbook template

RUNBOOK: High 5xx error rate on the API

== SYMPTOM ==
Alert "api-5xx-rate" > 5% over 5 minutes.

== DIAGNOSIS ==
1. Check the golden-signals dashboard (latency, errors).
2. See whether a deploy happened:
     kubectl rollout history deployment/api
3. Check pod state:
     kubectl get pods -l app=api

== REMEDIATION ==
IF the cause is a recent deploy:
     kubectl rollout undo deployment/api
IF the cause is DB overload:
     # raise replicas until recovery
     kubectl scale deployment/api --replicas=2

== VERIFICATION ==
1. The 5xx ratio is back under 1% over 5 minutes.
2. p99 latency is normal.

== ESCALATION ==
If not stabilised within 15 minutes — call the team lead (L2),
declare a major incident, start a postmortem.

A checklist for every runbook

Does it start from a concrete symptom or alert?
Can the steps be run without knowing the whole system?
Are all commands current and tested?
Is there an explicit escalation condition?
Does it say how to verify the fix?

Common mistakes

Mistake	How to fix
Stale commands	Review after every infrastructure change
Steps too vague	Concrete commands instead of "restart the service"
No escalation condition	Explicit timeout and L2 contact
A 10-page runbook	One scenario — one short runbook

How enterno.io fits into a runbook

Many runbooks start with "confirm the service is really down from the outside". As external (synthetic) monitoring, enterno.io answers that: HTTP, SSL, Ping and DNS checks from Russia, Europe and the US show whether the problem is visible to users or just a local glitch. Alerts arrive via Telegram, Slack, email, webhook, PagerDuty and Jira — right at the start of the runbook.

Connect monitors to confirm the outage, publish a status page, and enable heartbeat for cron and queues. For shifts, see our on-call article.

FAQ

How is a runbook different from a postmortem?

A runbook is the instruction before and during an incident — how to fix it. A postmortem is the review after — why it happened and how to avoid a repeat.

How many runbooks does a team need?

One per common actionable alert. Many short runbooks beat a single huge document.

How do I keep runbooks current?

Review after every incident and every infrastructure change. A stale runbook is more dangerous than none.

Can runbooks be automated?

Yes, partly: repetitive steps become scripts, while the runbook leaves context-dependent decisions to the human.

Add an external check to your runbooks. Create monitors at enterno.io/monitors so the first diagnostic step takes seconds.

Check your website right now

Check your site →

Incident Runbook Guide

Why a runbook matters

The structure of a good runbook

Runbook template

A checklist for every runbook

Common mistakes

How enterno.io fits into a runbook

FAQ

How is a runbook different from a postmortem?

How many runbooks does a team need?

How do I keep runbooks current?

Can runbooks be automated?

Start monitoring for free