Evaluating Whether an AI Tool Can Actually Build It

A scope-laddering method for testing what an AI tool can really do

4 min read

Vendor demos and benchmark numbers will not tell you whether an AI tool can do your work. The only honest answer comes from running it against a realistic task and watching where it succeeds and where it falls apart. This play is a repeatable way to do exactly that — and because AI capability moves fast, it is designed to be re-run, not done once.

When to use this play#

Reach for this whenever you are deciding if an AI tool is good enough for a real job — for example, "can this assistant generate a usable starting point for an application from a prompt and some design images?" It complements the tool evaluation play, which covers security and licensing; this one measures raw capability.

How to run it#

The core idea is to start by asking for everything, then ladder the scope down step by step, so you find the exact point where the tool becomes genuinely useful.

Build a small reference by hand. Implement a slice of the target yourself first. Without a baseline you have nothing to compare the AI's output against.
Prepare realistic inputs. Gather the same materials you would hand a person — for instance, images of the screens, and a clear statement of the stack and tests you expect.
Write a maximal prompt. Ask for the whole thing: every application, full test coverage, infrastructure, a single repository, documentation, and a downloadable artifact. This is the ambitious end of the ladder.
Run it across several tools. Give the same maximal prompt to multiple tools and models. Evaluate each result against your baseline.
Reduce to a mid-scope prompt. Cut the ask down to a meaningful subset and run it across the same tools again.
Reduce to the minimum viable ask. Now request just one small, well-defined piece — a single screen or function. Run it across the tools one more time.
Compare and eliminate. Score the tools on flexibility, input limits (such as how many images they accept), and how usable the output actually is. Drop the weak ones early and keep testing the promising one.

What the results tend to reveal#

Patterns recur across this kind of testing, even as the specific tools change:

The maximal ask tends to disappoint. Asking a general-purpose assistant to produce a complete, usable application from a prompt and images alone usually yields skeleton or boilerplate output — and the "downloadable artifact" is often broken, empty, or never produced at all.
Usefulness climbs sharply as scope shrinks. The same tool that flails on a whole app frequently produces a genuinely good starting point when asked for a single screen with clear context.
Practical limits bite. Caps on how many images you can upload, and unreliable packaging of downloadable outputs, are common friction points worth noting in your evaluation.

The durable conclusion is that these tools are most effective on small, well-scoped requests in collaboration with an experienced engineer — not as autonomous end-to-end builders. That principle has its own note.

Common traps#

No baseline. Without something you built by hand, "good output" is just a vibe.
Testing only the maximal ask. If you stop after the tool fails the big request, you miss the scope where it actually shines.
Treating the result as permanent. This is a snapshot. Re-run the ladder periodically, because capability advances quickly.
Naming-and-shaming a vendor on stale results. A verdict from one quarter can be wrong the next; record the date and the version you tested.

Signals it's working#

You walk away knowing the precise scope at which a tool earns its keep, you can justify adopting or rejecting it with evidence rather than marketing, and you have a method you can re-run the next time a new model lands.