Running Blameless Postmortems
Fix the system that allowed the failure, not the person who tripped over it.
A postmortem is not a trial. Its job is to make the system more reliable, and it can only do that if the people closest to the failure tell the truth about what happened. The fastest way to lose that truth is to look for someone to blame. This play describes how to run postmortems that surface honest information and turn it into real improvements.
When to use this play#
Run a postmortem after any incident your severity policy requires one for, and any time something failed in a way the team did not understand. The trigger is "we were surprised," not "someone is in trouble." If the only reason you are gathering is to assign fault, you do not need a postmortem; you need a different conversation, and probably a different culture.
Why blameless produces better reliability#
When people believe a postmortem might be used against them, they manage their exposure: they soften the timeline, omit the awkward decision, and leave out the workaround everyone secretly relies on. You end up with a tidy document and a fragile system. When people trust that the postmortem is about the system, they report honestly, and honest reporting is the raw material reliability is built from. The root cause is never "person X deployed bad code." It is "the system or decision that allowed bad code to reach production without being caught." Phrase it that way and you start finding fixes instead of scapegoats.
How to run it#
1. Reconstruct the timeline from the record. Pull the timestamped notes from the incident channel and lay out what happened in UTC, marking detection, acknowledgment, mitigation, and resolution. Use the record, not memory.
2. Establish impact honestly. State who was affected, for how long, and quantify it. Vague impact leads to vague priorities.
3. Find the root cause in the system. Ask what system or decision made the failure possible. Keep going until the answer is something you can actually change, not a person you can blame.
4. Capture contributing factors and what went well. Most incidents have several contributing factors, not one clean cause. Note what went well too, because those are the practices worth protecting.
5. Write action items with owners and due dates. An action item without both an owner and a due date is a wish, not a commitment. Every item names exactly one owner and a date.
The postmortem template#
- Summary — one or two sentences on what happened.
- Timeline — in UTC, with detection, acknowledgment, mitigation, and resolution marked.
- Impact — who, how long, quantified.
- Root cause — the system or decision that allowed it.
- Contributing factors — everything else that made it worse or more likely.
- What went well — the practices and tooling that helped.
- What went poorly — where the response or system fell short.
- Action items — each with a single owner and a due date.
Track the action items until they close#
Action items die quietly when no one watches them. Track them in one place per product and review them weekly until every one is closed. The error budget is your input for deciding how hard to push: when a service is burning through its budget, its postmortem action items jump the queue; when it is comfortably within budget, you can weigh those items against new work instead of treating every one as urgent. This is what keeps the backlog honest and the prioritization grounded in reliability targets rather than the loudest recent incident.
Keep it from becoming theater#
A postmortem becomes theater when it produces a polished document that changes nothing. The warning signs are familiar: action items with no owner, items that roll over week after week, root causes that conveniently point at an individual, and meetings where everyone already knows the conclusion before they sit down. Guard against it by tying action items to error budgets so they have real priority, reviewing them weekly so they cannot quietly rot, and protecting the blameless framing so people keep telling you the truth.
Common traps#
- Naming a person as the root cause. "Person X did Y" is never a root cause; it is the end of learning. Ask what allowed Y to matter.
- Action items without owners or dates. These are wishes. They will not happen.
- Letting the document be the deliverable. The deliverable is a more reliable system. The document is just the record.
- Skipping the weekly review. Unwatched action items do not get done.
- Ignoring the error budget. Without it, you cannot tell which fixes are worth the investment and which are gold-plating.
Signals it's working#
- People volunteer uncomfortable details because they trust the process.
- Action items close on or near their due dates.
- Repeat incidents from the same root cause become rare.
- The error budget, not the most recent scare, drives what gets fixed next.