AI Coding Tools Work Best Small and Supervised

Autonomy degrades as scope grows — slice the work and stay in the loop

3 min read

If you have actually pushed AI coding tools hard, you have probably noticed the same thing I have: the quality of what you get back is almost inversely proportional to the size of what you asked for. Hand a tool a whole application and you get a skeleton. Hand it a single screen with clear context and you get a genuinely useful starting point. That is not a quirk of one tool — it is a pattern worth designing around.

The scope effect#

Ask a general-purpose assistant to build everything at once — every application, full test coverage, infrastructure, documentation, a downloadable bundle — and the output tends to collapse into boilerplate, with the most ambitious deliverables broken or missing entirely. Narrow the same request to a meaningful subset and the results improve. Narrow it again to one small, well-defined piece and the tool often produces something you would happily build on.

The capability did not change between those three asks. The scope did. (If you want to measure this for a specific tool rather than take my word for it, there is a repeatable method for laddering the scope down and finding the point where a tool earns its keep.)

Why this happens#

A few forces stack up as scope grows:

Ambiguity compounds. A big, open request has a thousand unstated decisions; each one is a chance to drift from what you wanted.
Errors cascade. In a large generated artifact, an early wrong assumption poisons everything downstream, and there is no test or human checkpoint to catch it.
It fights the model's strengths. These tools are strong at pattern completion over a bounded context and weak at long-horizon planning and integration. Small slices play to the strength; whole systems expose the weakness.
Verification gets harder. You can actually review a single function. Nobody can meaningfully review an entire generated codebase in one sitting, so problems hide.

How to work with it#

Slice the work small. Break the job into the smallest well-specified units you can, the same discipline that makes stories small pays off here.
Give concrete context. Examples, types, and a hand-built baseline beat a long prose description every time.
Keep a human directing. Treat the tool as a fast, tireless pair-programming partner, not an autonomous builder. The engineer owns architecture, integration, and the final call.
Treat output as a draft. It is a starting point to verify and refine, not a deliverable to ship. Hold it to the same review and testing bar as any other code, which is also the most environmentally and economically sensible way to spend the tool's effort.

A snapshot, not a verdict#

Everything here is a point-in-time observation, and the point in time keeps moving — fast. The specific limits will loosen as models improve, and the threshold where a tool becomes useful will creep upward. What is unlikely to change soon is the shape of the curve: smaller and more supervised will keep beating bigger and more autonomous for a while yet. Re-test periodically, and let the evidence, not the hype, tell you where the line currently sits.