Incident Postmortem Guide

Anatoly Oshmanovsky

Monitoring

Incident Postmortem Guide

Published: 22.06.2026 · ~4 min · 65 views

Short answer. A postmortem is a review document written after an incident whose goal is not to find a culprit but to understand how the system allowed the failure and avoid repeating it. "Blameless" means focusing on processes and systems, not on people. A good postmortem contains a timeline, user impact, root causes and concrete actions with owners and deadlines.

Why blameless

If people are punished for incidents, they start hiding mistakes — and the organisation loses its ability to learn. The blameless approach does the opposite: it rewards honesty.

The person who pressed "the wrong button" is a symptom, not a cause. The cause is that the system let one button trigger an outage with no safeguard.

When to write a postmortem

Any incident that affected users or breached an SLO.
Near-misses that almost became outages.
Pages that required manual responder intervention.
Recurring small failures — even if each one alone is minor.

Postmortem structure

A standard template that is easy to adapt:

POSTMORTEM: 
Date: 2026-06-22
Authors: 
Status: draft / final

== SUMMARY ==
1–2 sentences: what broke and the impact.

== IMPACT ==
- Duration: 14:02–14:47 (45 min)
- Affected: ~30% of API requests returned 503
- SLO: spent 45 min of error budget out of 43.2

== TIMELINE (UTC) ==
14:02  Deploy v2.4.1
14:05  503 errors rise, alert fires
14:09  On-call acknowledges the incident
14:23  Root cause found: DB connection pool exhausted
14:31  Rolled back to v2.4.0
14:47  Metrics normal, incident closed

== ROOT CAUSES ==
1. New code opened a connection per request without returning it to the pool.
2. Load testing did not cover peak traffic.

== WHAT WENT WELL ==
- The alert arrived within 3 minutes.
- The rollback took under 5 minutes.

== ACTION ITEMS ==
[ ] Return connections to the pool  — @ivan  — by Jun 25 (P1)
[ ] Add a peak load test            — @olga  — by Jun 30 (P2)
[ ] Alert on DB pool saturation     — @ivan  — by Jun 27 (P1)

How to find the root cause

Build the timeline from facts and logs, not from memory.
Ask "why?" several times in a row (the 5 Whys method) until you reach a systemic cause.
Distinguish the trigger (the deploy) from the real vulnerability (no pool safeguard).
Look for a chain of conditions, not a single cause — complex outages usually have several.

What makes good action items

Bad action item	Good action item
"Be more careful"	"Add an alert on DB pool saturation"
No owner	Owner and deadline assigned
No priority	P1/P2 with a clear due date

How enterno.io helps with postmortems

An accurate timeline starts with accurate data. As external (synthetic) monitoring, enterno.io records when an outage began and ended, response codes and reaction times — the backbone of the timeline section. HTTP, SSL, Ping and DNS checks run every minute or every 30 seconds, multi-region from Russia, Europe and the US, and incidents are opened and closed automatically.

The history on your monitors and a public status page provide objective timestamps. For background jobs use heartbeat. For the response side, see the incident response plan.

FAQ

Who should write the postmortem?

Usually the incident manager or on-call responder, but everyone involved contributes. The document belongs to the team, not one person.

How soon after an incident should I write it?

Within 1–3 days, while details are fresh. Delay leads to lost facts and a less accurate timeline.

Should postmortems be published externally?

For major public incidents, yes — a short customer-facing version. The full internal review stays with the team.

What should happen to action items?

File them as tasks with an owner, priority and deadline, and track completion. A postmortem without completed actions is useless.

Capture an accurate incident timeline. Connect enterno.io monitors so every failure carries objective timestamps for the review.

Check your website right now

Check your site →

Incident Postmortem Guide

Why blameless

When to write a postmortem

Postmortem structure

How to find the root cause

What makes good action items

How enterno.io helps with postmortems

FAQ

Who should write the postmortem?

How soon after an incident should I write it?

Should postmortems be published externally?

What should happen to action items?

Start monitoring for free