SLOs and Error Budgets, Explained

SLOs and Error Budgets, Explained

Turning reliability debates into decisions you can argue with data

5 min read

Without a service level objective, "reliability work" is just opinion. Someone thinks the system feels slow, someone else thinks it's fine, and the argument goes nowhere because nobody is holding a number. SLOs fix that. They make reliability arguable with data instead of vibes, and that single shift changes how a team prioritizes.

SLI, SLO, and error budget#

These three terms get used interchangeably, but they're distinct and the difference matters.

  • An SLI (service level indicator) is a measurement. It's the raw signal: the percentage of requests that returned successfully, the p95 latency on your checkout endpoint, the fraction of jobs that completed without error. An SLI is a number you observe.
  • An SLO (service level objective) is the target you hold that SLI to. For example, "99.9 percent of requests succeed over a rolling 30 days." The SLO is a promise the team makes to itself and its users.
  • An error budget is the inverse of the SLO. If your objective is 99.9 percent availability, then 0.1 percent of requests are allowed to fail. That allowance is your budget. It converts an abstract reliability goal into a concrete, spendable quantity of acceptable failure.

That last point is the one that unlocks everything else. A 99.9 percent monthly target means you have roughly 43 minutes of allowable downtime per month. Once you frame failure as a budget, reliability stops being a binary "is it up or down" question and becomes a question of how much of your allowance you have left.

Tie targets to business criticality#

The most common mistake I see is picking one blanket reliability number and applying it everywhere. That wastes effort on systems that don't need it and underserves the ones that do. Targets should track how much the business actually cares if a given system degrades.

A reasonable set of default tiers, purely as a starting point to adjust:

  • Internal tooling: around 99.0 percent monthly availability. If the internal dashboard hiccups for a few minutes, nobody loses revenue.
  • Non-revenue-critical product surfaces: around 99.5 percent.
  • Revenue-critical product: around 99.9 percent. This is the path where downtime directly costs money or trust.
  • Web latency default: p95 under 1 second for your top endpoints. Most users feel anything slower.

These are illustrative. The point isn't the exact decimals, it's that each system gets a target proportional to what's at stake, and you can defend every number.

The error budget freeze rule#

Here's where budgets earn their keep. Decide in advance what happens when the budget runs out, before you're in the middle of a bad week and tempted to negotiate with yourself.

The rule I recommend is simple: when the error budget is burned, feature work stops. The team shifts its focus to reliability, hardening, and paying down whatever caused the burn, and feature work resumes only once the burn rate slows back to a sustainable level.

This sounds harsh, but it's the whole reason error budgets work. It removes the perpetual tension between "ship faster" and "be more reliable" by giving you a pre-agreed switch. As long as you're spending your budget at a healthy pace, you ship freely. Burn it too fast, and the system itself tells you to slow down. No one has to win an argument in the moment.

From debate to decision#

Think about how reliability conversations usually go without this machinery. Someone wants to add monitoring, someone wants to refactor a flaky service, someone wants to ship the next feature, and the loudest or most senior voice tends to win. There's no shared way to say whether the investment is justified.

With SLOs and error budgets, the conversation changes shape entirely:

  • Are we meeting our SLO with budget to spare? Then keep shipping features, and don't gold-plate reliability you don't need.
  • Are we burning budget faster than expected? Now you have a data-backed reason to pause and invest, and nobody can hand-wave it away.
  • Is a proposed feature likely to threaten the budget? You can weigh that explicitly instead of discovering it in production.

The budget turns subjective debates into decisions with a clear trigger. That's the real value. You're no longer arguing about whether the system is reliable enough; you're looking at how much budget remains and acting accordingly.

Getting started#

You don't need a perfect observability stack to begin. Pick your two or three most important user-facing flows. Define one or two SLIs for each, something you can actually measure today, like request success rate and p95 latency. Set an honest SLO based on the criticality tier the flow belongs to. Then watch the budget for a few weeks before you enforce any freeze rule, so your targets reflect reality rather than wishful thinking.

Once the numbers are in front of the team, you'll find the reliability conversations get noticeably calmer. That's the sign it's working.