Short answer. A postmortem is a review document written after an incident whose goal is not to find a culprit but to understand how the system allowed the failure and avoid repeating it. "Blameless" means focusing on processes and systems, not on people. A good postmortem contains a timeline, user impact, root causes and concrete actions with owners and deadlines.
Why blameless
If people are punished for incidents, they start hiding mistakes — and the organisation loses its ability to learn. The blameless approach does the opposite: it rewards honesty.
The person who pressed "the wrong button" is a symptom, not a cause. The cause is that the system let one button trigger an outage with no safeguard.
When to write a postmortem
- Any incident that affected users or breached an SLO.
- Near-misses that almost became outages.
- Pages that required manual responder intervention.
- Recurring small failures — even if each one alone is minor.
Postmortem structure
A standard template that is easy to adapt:
POSTMORTEM:
Date: 2026-06-22
Authors:
Status: draft / final
== SUMMARY ==
1–2 sentences: what broke and the impact.
== IMPACT ==
- Duration: 14:02–14:47 (45 min)
- Affected: ~30% of API requests returned 503
- SLO: spent 45 min of error budget out of 43.2
== TIMELINE (UTC) ==
14:02 Deploy v2.4.1
14:05 503 errors rise, alert fires
14:09 On-call acknowledges the incident
14:23 Root cause found: DB connection pool exhausted
14:31 Rolled back to v2.4.0
14:47 Metrics normal, incident closed
== ROOT CAUSES ==
1. New code opened a connection per request without returning it to the pool.
2. Load testing did not cover peak traffic.
== WHAT WENT WELL ==
- The alert arrived within 3 minutes.
- The rollback took under 5 minutes.
== ACTION ITEMS ==
[ ] Return connections to the pool — @ivan — by Jun 25 (P1)
[ ] Add a peak load test — @olga — by Jun 30 (P2)
[ ] Alert on DB pool saturation — @ivan — by Jun 27 (P1)
How to find the root cause
- Build the timeline from facts and logs, not from memory.
- Ask "why?" several times in a row (the 5 Whys method) until you reach a systemic cause.
- Distinguish the trigger (the deploy) from the real vulnerability (no pool safeguard).
- Look for a chain of conditions, not a single cause — complex outages usually have several.
What makes good action items
| Bad action item | Good action item |
|---|---|
| "Be more careful" | "Add an alert on DB pool saturation" |
| No owner | Owner and deadline assigned |
| No priority | P1/P2 with a clear due date |
How enterno.io helps with postmortems
An accurate timeline starts with accurate data. As external (synthetic) monitoring, enterno.io records when an outage began and ended, response codes and reaction times — the backbone of the timeline section. HTTP, SSL, Ping and DNS checks run every minute or every 30 seconds, multi-region from Russia, Europe and the US, and incidents are opened and closed automatically.
The history on your monitors and a public status page provide objective timestamps. For background jobs use heartbeat. For the response side, see the incident response plan.
FAQ
Who should write the postmortem?
Usually the incident manager or on-call responder, but everyone involved contributes. The document belongs to the team, not one person.
How soon after an incident should I write it?
Within 1–3 days, while details are fresh. Delay leads to lost facts and a less accurate timeline.
Should postmortems be published externally?
For major public incidents, yes — a short customer-facing version. The full internal review stays with the team.
What should happen to action items?
File them as tasks with an owner, priority and deadline, and track completion. A postmortem without completed actions is useless.
Capture an accurate incident timeline. Connect enterno.io monitors so every failure carries objective timestamps for the review.