Operating Production Is a Discipline, Not a Toolset

Response discipline is permanent; the tooling is interchangeable

4 min read

Every few years the operations tooling market reshuffles, and teams convince themselves the new platform will finally make incidents manageable. It won't, at least not by itself. The thing that actually keeps production healthy is discipline, and discipline doesn't ship in a box.

Separate the durable from the disposable#

Draw a clear line between what's permanent and what's interchangeable. Your response discipline, how you detect, communicate, mitigate, and learn, is durable. It will outlast every tool you adopt. The specific paging app, dashboard vendor, or incident tracker is disposable; you could swap any of them next year without changing how well you operate.

When you invest, invest in the durable side. A team with strong response discipline and mediocre tools will run circles around a team with the best tools and no discipline.

Distribute ownership across roles#

Concentrating operational ownership in one heroic individual is a failure mode, not a strength. Spread it out so the system survives any one person going on vacation:

Customer communications during incidents need an owner so users aren't left guessing.
On-call sustainability needs an owner who watches rotation health and alert load, not just coverage.
SLO ownership needs someone accountable for the reliability targets themselves.
Runbook authorship needs to be someone's explicit responsibility, not an afterthought.

When these roles are distributed, no single departure cripples your operations, and no single person quietly burns out carrying all of it.

Mitigate before you investigate#

In the middle of an incident, the instinct to fully understand the root cause is strong and usually wrong. Stopping the impact beats understanding it. Roll back, fail over, shed load, flip the flag, whatever halts the bleeding. You can and should investigate thoroughly afterward, with the pressure off and the customer impact already contained. Mitigate first, investigate second. The postmortem is where understanding belongs.

Postmortems build trust#

Treat postmortems as a trust-building asset, not a liability to hide. A well-run, blameless postmortem signals to the whole organization that the team faces its failures honestly and gets better because of them. Hiding incidents or sanitizing the writeup does the opposite; it tells everyone that failure is shameful and therefore must be concealed, which is exactly how small problems grow into large ones.

There's a leading indicator hiding in your postmortem record, too: incident frequency trend. Whether incidents are becoming more or less common over time tells you more about system health than any single outage does. A rising trend is a warning even if each individual incident was minor.

Write the runbook while the knowledge is fresh#

The moment you finish debugging a new alert for the first time, write the runbook. Right then, while the context is still loaded in your head, before you forget which dead ends you ruled out and which check actually mattered. A runbook written a week later is a guess; one written in the moment is a record. This single habit compounds enormously, because the next person to get paged for that alert inherits everything you just learned instead of rediscovering it at 3 a.m.

Common operations traps#

Most operational dysfunction comes from a handful of recurring traps. Name them so you can catch them:

Hero culture. If resolving an incident took a hero, the system failed. Celebrating the rescue hides the design flaw that made the rescue necessary.
Postmortem shame. Punishing people for incidents destroys the entire value of the postmortem. Once it's punitive, people hide problems, and you lose your best learning mechanism.
Action items without an owner or due date. An action item nobody owns and nothing schedules is a wish, not a commitment. It will not happen.
Skipping postmortems for "minor" incidents. The individual incident may be small, but the pattern across many small incidents is exactly what you need to see. Skip them and you go blind to the trend.
Tolerating the same alert firing repeatedly. A recurring alert is the system asking for attention. Acknowledging and dismissing it over and over trains the team to ignore signals that matter.
Assigning on-call to whoever pushed back least. On-call should be distributed fairly and sustainably, not dumped on the person with the weakest boundaries. That path leads straight to burnout and attrition.

The takeaway#

Buy whatever tools serve you, and switch them freely when better ones come along. But don't mistake the tools for the capability. The capability lives in your habits: mitigate before investigating, write runbooks while the knowledge is hot, run blameless postmortems, distribute ownership, and watch the frequency trend. Those practices are the part worth protecting, because they're the part that actually keeps production running.