Incident Response and Severity Classification

Classify by impact, mitigate before you investigate, and write the timeline as you go.

5 min read

When a system breaks, the worst time to invent your process is in the middle of the outage. This play gives you a shared severity language, a response sequence that holds up under pressure, and the templates that turn a chaotic event into something your team can run calmly and learn from afterward.

When to use this play#

Use it the moment something looks wrong in production and you are not certain whether it is a minor blip or a real incident. The classification step exists precisely for that ambiguity. If you are debating whether something "counts," declare it and let the severity matrix tell you how much to do.

Classify by impact, not by adrenaline#

Severity is set by the impact on customers, never by how stressful the situation feels. A scary-looking alert with no user impact is not a high-severity incident; a quiet failure that is silently corrupting data is. The first responder declares an initial severity so the response can start immediately, and a tech lead can adjust it later as the picture sharpens. When you are genuinely unsure, escalate up rather than down.

The severity matrix:

SEV1 — total customer-facing outage, data loss, or an active security incident. Page immediately. Leadership notified within 15 minutes. Status page posted within 15 minutes. Postmortem required.
SEV2 — major degradation or a critical feature broken for a meaningful subset of users. Page the primary. Status page within 30 minutes. Postmortem required.
SEV3 — minor degradation or single-user impact with a workaround available. Ticket only, no page, triaged the next business day.
SEV4 — cosmetic. Goes to the backlog, no page.

How to run it#

Move through the sequence in order, but never let process delay mitigation.

1. Detect. An alert fires or someone reports a problem. Every alert links to a runbook so the responder is not starting from a blank page.

2. Acknowledge. For SEV1 and SEV2, acknowledge within 5 minutes so everyone knows a human is on it.

3. Triage and assign an incident commander. Confirm the severity and name a commander to coordinate. The commander runs the response; they do not have to be the one with their hands on the keyboard.

4. Communicate. For customer-facing incidents, post to the status page within the severity's time window and keep it current.

5. Mitigate before you investigate. A 5-minute outage you stopped beats a 60-minute outage you fully understood. Roll back, throttle, or fail over to stop the bleeding first. Root cause can wait until users are no longer affected.

6. Resolve. Keep monitoring for at least 30 minutes after mitigation before declaring resolution. Recurrence right after a premature "all clear" erodes trust faster than the original incident.

7. Postmortem. For SEV1 and SEV2, complete a postmortem within five business days.

Keep the channel disciplined#

Every response action gets a short, timestamped note in the incident channel as it happens. That running log is the source timeline for the postmortem; reconstructing it from memory afterward never works. Channel discipline matters: if you are not part of the response, you read, you do not post. A response channel flooded with bystander questions is one where the actual responders cannot find each other.

The runbook template#

Every alert points to a runbook with these sections:

Symptom — what the problem looks like, in plain human language.
Likely causes — ranked by how often they are the culprit.
Investigation steps — concrete commands to run and dashboard links to open.
Mitigation steps — the actions that stop the bleeding.
Escalation — exactly when to wake the lead.

The customer status-page rules#

Post within 15 minutes of declaring the incident.
Write in plain language, not internal jargon.
Update at least every 30 minutes, even if the update is "still investigating."
Close with a one or two sentence "resolved" note when it is over.

Common traps#

Classifying by stress. A calm-looking data-corruption bug deserves a higher severity than a loud but harmless alert. Anchor on impact.
Investigating before mitigating. Curiosity is valuable, but not while customers are down. Stop the bleeding, then understand it.
Declaring victory too early. Skipping the 30-minute monitoring window turns one incident into two.
A noisy incident channel. Bystanders posting questions drown out the responders. Read-only unless you are responding.
Reconstructing the timeline later. If you did not write it down as it happened, the postmortem timeline will be wrong.

Signals it's working#

The team reaches for the same severity labels without arguing about them.
Mitigation consistently happens before deep investigation.
Postmortems start with a timeline that was already written, not pieced together from memory.
Status-page updates land inside their windows and customers stop having to ask whether you know.