Designing a Humane On-Call Rotation

Designing a Humane On-Call Rotation

Carry the pager without burning out the people who carry it.

4 min read

On-call is where good engineering organizations quietly fail their people. The system works, the dashboards stay green, and one or two responders slowly grind themselves down covering for a rotation that was never designed to be survivable. This play describes how to build a rotation that protects reliability and the humans behind it at the same time.

When to use this play#

Reach for this when you have a production system that can break outside business hours and a team expected to respond. It applies whether you are formalizing an ad-hoc arrangement that grew up by accident or standing up a rotation for the first time. The minimum bar is three people willing and able to respond; below that, the rotation cannot be humane and you should solve the staffing problem before you solve the scheduling one.

How to run it#

1. Define distinct roles, not one overloaded hero. A workable rotation separates responsibilities so no single person owns everything:

  • Primary responder leads the response and is the first to be paged.
  • Secondary is the backup when the primary is unreachable or underwater.
  • Escalation point is the person the responders can reach when they need authority or deeper expertise.
  • Incident commander is named per major incident to coordinate the response, communication, and decisions. This is a role you step into, not a permanent title.

2. Use week-long shifts with offset coverage. A full week gives a responder enough continuity to hold context without the constant churn of daily swaps. Offset the secondary from the primary so the same two people are not exhausted in lockstep, and so a fresh backup is always available.

3. Compensate on-call explicitly. Carrying a pager is work, and unpaid or unacknowledged on-call is how resentment and attrition build. Pick a model and honor it: time off in lieu after a shift that involved real paging, or a stipend for carrying the rotation. The specific mechanism matters less than the principle that being on-call has a recognized cost the organization pays back.

4. Keep the rotation at least three people deep. Two-person rotations have no slack. Someone gets sick, takes vacation, or simply needs a week off, and the whole thing collapses onto one person. Three is the floor; deeper is better.

5. Protect recovery time after a page. Anyone paged after hours can take the following day off, no questions asked and no negotiation required. Sleep lost to an incident is a real cost, and pretending otherwise just degrades the quality of the next day's work and the next night's response.

6. Let vacation always win. Time off overrides on-call, full stop. If a shift lands on someone's vacation, swap the shift rather than interrupting the vacation. The whole point of protected time is that it stays protected.

7. Run a real handoff every week. The outgoing responder briefs the incoming one against a checklist so context does not evaporate at shift change:

  • Open incidents and their current status.
  • Recent changes that could generate noise: deploys, infrastructure changes, third-party events.
  • Outstanding action items from last week's postmortems.
  • Known fragile areas worth watching this week.

Common traps#

  • The three-person rotation that is really a one-person rotation. If the same responder always picks up because the others are unreliable, you have a staffing problem dressed up as a schedule. Fix the underlying coverage, do not paper over it.
  • Unpaid on-call. Treating availability as free is the fastest route to burnout and turnover. Acknowledge the cost explicitly or watch your best responders leave.
  • Skipping the day-after recovery. A responder who was up half the night and shows up anyway is doing worse work and is more likely to miss the next incident. Recovery time is an investment in reliability, not a perk.
  • Letting vacations get eaten. The first time someone gets paged on their time off and it is treated as normal, you have told the whole team that protected time is not real.
  • Handoffs by osmosis. Without a checklist, context lives only in the outgoing responder's head and walks out the door with them. The incoming responder inherits the pager and none of the situational awareness.

Signals it's working#

  • Responders take their full vacations and their day-after-page recovery without having to fight for it.
  • The handoff checklist gets filled in every week, and incoming responders start their shift already knowing where the risk is.
  • No single name is attached to the majority of incidents.
  • People volunteer to join the rotation rather than dodging it, because being on-call is a manageable, compensated responsibility instead of a punishment.