Actually Measuring Whether AI Improves Productivity

Beware self-reported gains, lean on delivery and flow metrics, and expect the individual-vs-team gap

6 min read

Everyone wants a single number that proves AI made the team faster. There isn't one, and chasing it produces a figure that is either flattering and wrong or rigorous and useless. The honest goal is converging signals: when several independent measurements point the same way, that's real. When they disagree, you've learned something more valuable than any one number could tell you. This note is about how to set that up. It assumes you already accept the framing in DORA Metrics and AI as an Amplifier, that AI magnifies whatever your delivery system already is, and goes one level deeper into measurement.

Start with a clean comparison window#

Adoption is almost never a flag day. People picked up the tools at different times, so pick a two-to-three-month transition window and call everything before it "pre-AI" and everything after "post-AI." Give yourself at least twelve months on each side. Anything shorter and you are measuring quarterly variance, not the effect of AI.

Then resist the urge to look only at aggregates. Aggregate metrics drown signal because work isn't homogeneous. Pick a few comparable cohorts of work that recur across the whole window:

Bug-fix tickets, the cleanest comparable unit, since a bug is roughly a bug.
Small, bounded feature tickets, scoped by estimate or label.
One long-running project that spans the transition end to end.

Segment engineers the same way. Some adopted hard, some barely. Do not average them. Comparing heavy adopters to light adopters gives you a quasi-control group inside your own team.

Lean on flow and delivery metrics, not vibes#

The quantitative backbone comes from two well-established frameworks. The four DORA delivery metrics (lead time, deployment frequency, change failure rate, recovery time) are validated across a decade of research. The SPACE framework exists precisely because productivity can't be captured by a single metric; it spans satisfaction, performance, activity, communication, and efficiency, and its authors recommend tracking at least three of those dimensions at once.

From version control, pull per-repo, per-month distributions (not just averages): pull-request cycle time, PR size, time-to-first-review, and review iterations per PR. Add code churn, the percentage of lines rewritten shortly after they were committed. Churn is the canary. Industry analysis of hundreds of millions of lines found churn climbing while refactored "moved" code fell, suggesting speed bought with rework. If your churn rises in lockstep with adoption, you are paying for velocity with cleanup.

From your ticketing system, pull cycle time as median and P90, estimate-versus-actual, and throughput per cohort. From deployment logs, pull deployment frequency, rollbacks or hotfixes within 24 hours as a change-failure proxy, and bugs filed within 30 days of a release.

Self-reported data is a pillar, not a weakness#

A common mistake is dismissing surveys as soft. They are an explicit pillar of SPACE. The trick is honesty about confidentiality: on a small team, don't pretend a survey is anonymous when it can't be. Make it confidential to a named one or two people and say so. Ask six to eight questions and repeat them quarterly, because the trend matters far more than any single round. Useful prompts:

What share of your coding now involves AI?
Where does AI most clearly save you time?
Where does it create rework or pull you off course?
Are you shipping more, less, or about the same value per week versus a year and a half ago?
Has your confidence in the code you merge gone up, down, or flat?
Where do you not trust AI suggestions and have to slow down to compensate?

If you have business outcomes, use them#

A consultancy or product team with clean financial and delivery data has access most organizations lack. Track estimate variance (actual over estimated hours by engagement type); if AI is genuinely helping, this should trend below 1.0 on similar work over time. Track margin by engagement type, because AI gains either show up there or are being eaten somewhere. And track warranty or post-launch defect hours as a share of project hours, which is exactly where a "faster but less stable" pattern would surface.

What to plan around#

Five findings worth designing against:

The vacuum hypothesis. Per-task speedups can be real while reclaimed time gets absorbed by lower-value work, so per-quarter throughput barely moves. Both numbers can be true at once.
A throughput-versus-stability tradeoff. Research has associated rising AI adoption with small decreases in delivery throughput and stability. Never measure speed without measuring quality, or you'll find what you went looking for and miss the cost.
The individual-versus-team gap. Telemetry across thousands of developers shows individual output rising sharply while organizational delivery stays flat. Your survey may say "yes, faster" while your DORA metrics say "no change." That's not a contradiction; it's the most likely result.
Goodhart's Law. The moment people know you track PR throughput, throughput goes up by being gamed. Frame the whole effort as understanding, not evaluating, and never put these metrics into individual performance reviews.
Selection effects. Heavy adopters may simply be your strongest engineers, AI or not. Keep that in mind when comparing groups.

A note on individual metrics#

A useful discipline carries over from general performance measurement: use quantitative metrics for teams and projects, but keep individual assessment qualitative. Team-level signals like deployment frequency, change failure rate, cycle time, defect rate, and satisfaction scores are fair game for tracking over time. Individual contribution is better evaluated through dimensions and feedback loops than through a productivity counter, both because the numbers are easy to game and because the amplifier dynamics live in the system, not the person.

If you only do one thing#

Pull bug-fix cycle time and PR time-to-merge for the last two years, plot them as monthly rolling averages, mark your adoption window on the chart, and look at the slope on each side. It's cheap, it's honest, and it tells you whether the rest of this is worth the investment.