Rebuild from Scratch: The Last Resort

Rebuild from Scratch: The Last Resort

Build a parallel system from scratch only when every precondition is met.

3 min read

A ground-up rebuild constructs a parallel system from scratch and cuts over once it is proven. It is the sub-play people reach for first and should reach for last. Full rebuilds routinely run longer than estimated, discard knowledge that was never written down, and replace a working-but-ugly system with a not-yet-working clean one. This page is as much about why to avoid a rebuild as how to do one.

When to use this play#

Choose a ground-up rebuild only when every one of these is true:

  • The codebase is small, roughly under 20k lines.
  • The domain is well understood by people who will be on the new build.
  • The old and new systems can run in parallel during cutover.
  • Leadership accepts, going in, that it will take longer than estimated.

If any of these fails, pick a different sub-play. In particular, a large codebase or a domain that lives mostly in people's heads should steer you to Strangler Fig or Branch by Abstraction instead.

How to run it#

1. Understand the existing system thoroughly. Study it deeply, especially the edge cases. The unglamorous conditionals are usually where the institutional knowledge hides, and they are exactly what a rebuild tends to drop.

2. Build in parallel. Construct the new system alongside the old one. The old system keeps serving production the entire time.

3. Run both in production, comparing outputs. Send real traffic through both and compare what they produce. Discrepancies are how you discover the behavior the old code encoded that nobody remembered.

4. Cut over only after confidence is built. Move production to the new system when the comparison has earned your trust, not when the calendar says to.

5. Keep a rollback path until the old system is fully decommissioned. Do not kill the old system the moment you cut over. Retire it only once the new one has proven itself in production.

Common traps#

  • Underestimating the institutional knowledge in the old code. The messy parts often encode hard-won lessons. Treat surprising code as a question, not a mistake.
  • "Done in six months." It almost never is. If the estimate feels comfortable, it is probably wrong.
  • Killing the old system before the new one is proven. This removes your only safety net at the riskiest moment.
  • Letting the original team disengage. They are the source of the undocumented behavior you are trying to reproduce. Keep them involved.
  • Adding "while we're at it" features. A ground-up rebuild is already the high-risk option; piling new scope on top makes the comparison impossible and the timeline fictional.

Signals it's working#

  • The new system's outputs match the old system's across real traffic, including the edge cases.
  • Discrepancies are shrinking as you reproduce undocumented behavior.
  • The rollback path remains available and untouched until the very end.

How it ends#

It ends when the new system has matched the old one in production long enough to earn trust, you cut over, and only then decommission the old system and retire the rollback path. If you find yourself here often, that is a signal to revisit the earlier sub-plays, because a ground-up rebuild is rarely the right call.