Experienced developers in a controlled trial felt roughly 20% faster with AI tools, but the stopwatch showed them running 19% slower—a nearly 40-point inversion between feeling and fact.
The study that broke the productivity gauge
METR ran a randomized controlled trial on 16 experienced open-source developers working in codebases they knew well, using current frontier AI tools. Before the tasks, developers expected a speedup. Afterward, they reported feeling about 20% faster. The clock said they were 19% slower. The people most confident the tool was helping were the ones it was measurably slowing down.
The authors are careful: 16 developers over 246 tasks doesn't prove AI slows everyone. The effect flips positive for juniors and greenfield work. But the self-report-versus-stopwatch inversion is the part that should keep every engineering leader up at night.
What the larger datasets show
Faros AI looked across more than 10,000 developers: pull requests merged up 98%, PR size up over 150%, review time up 91%, for roughly zero net change in delivery. 31% of PRs merged with no review at all. DORA found higher AI adoption associated with a measurable drop in delivery stability, with damage persisting into this year. GitClear, reading 200 million changed lines, found copy-pasted code rising, code churn rising, and refactoring collapsing to under 10% of changes—2024 being the first year developers pasted more code than they reorganized.
The pattern is identical across every dataset: more generated, more merged, more churned. Same amount delivered, shakier when it lands.
The real bottleneck is verification, not generation
Generation got cheap. Verification got expensive. We removed the old bottleneck and shipped the work straight into a new one: review. The volume exploded at the exact stage we didn't re-staff, and the dashboards we trust can't see the cost because it lands downstream in incidents and churn and reviewer burnout, on a different page from the velocity chart everyone is cheering.
The tooling market already conceded this point. Windsurf—the agent-first IDE—was acquired by the maker of Devin after Google stripped its founders and core researchers into DeepMind. The most aggressive bet in the tooling space is a bet that the job is now verification: sitting at a dashboard and reviewing what agents produced.
This is likely the dip in a J-curve, not the destination. New tools cost before they pay. Juniors and new code (growing share of output) show positive effects. DORA's throughput has started recovering even as stability lags. But the discipline for anyone running a team is simple: stop steering by how fast it feels. The feeling is the one number we now know reads backward. Measure what reaches production and stays standing. Re-staff the stage where the work piles up. Treat any productivity claim that lives in a feeling as unproven until the stopwatch agrees.
Source: The gauge broke: devs felt 20% faster with AI, measured 19% slower
Domain: intrepidkarthi.com
Comments load interactively on the live page.