All 64 runs passed hidden acceptance checks. Zero scope violations. Explicit contracts for AI coding agents didn't make the code better - they made the code easier to review.
The paper's authors built a dependency-free TypeScript API task environment with seeded defects and documentation gaps. Ten tasks across five families, 64 agent executions across two model tiers, under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run scored with hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric - 192 reviews in total.
The Punchline: Correctness Washed Out, Reviewability Won
Explicit contracts did not improve objective task outcomes. Zero failures across the board. But they did something more interesting: they made the outputs reviewable. Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66). Reviewer ambiguity dropped (p = 0.035). Changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when the contract demanded them.
I'd call that a signal, not a noise. The contract doesn't teach the agent to code better - it forces the agent to structure its output for human consumption. That's a different problem entirely, and arguably the harder one when you're shipping into a real repo.
The Cost of Clarity: 13% More Tokens, 38% More Wall Time
The gains came at a price. Contracts cost +13% agent tokens and +38% wall-clock time. The weaker model tier bore the larger overhead. On these small tasks, delegation contracts bought reviewability rather than correctness. The trade-off is explicit: you pay in latency and compute to get outputs that a reviewer can actually evaluate without re-running the entire thing.
The authors note the study is small (64 runs, 192 reviews) and the tasks are deliberately scoped. But the effect sizes are real, and the direction is clear. If you're building a pipeline where AI agents push code for human review, the contract structure may matter more than the agent's raw coding ability. The next step is scaling this to larger, messier codebases and seeing whether the reviewability gains hold when the contract covers half a sprint's work, not a single bug fix.
Source: Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work
Domain: arxiv.org
Comments load interactively on the live page.