AI Agents Flunk Enterprise Java Migration with Under 10% Success Rate

Less than 10% of AI-generated enterprise Java framework migrations pass behavioral validation—and the agents themselves think they're doing much better. IBM Research just dropped ScarfBench, a benchmark that measures whether coding agents can actually ship working software across three major Java ecosystems: Spring, Jakarta EE, and Quarkus.

Why Framework Migration Is Harder Than Bug Fixing

Bug-fixing benchmarks like SWE-bench have shown impressive progress. Framework migration is a different beast. It's not just replacing annotations; it requires translating dependency injection, persistence configuration, queries, and framework descriptors. A single mistake in any of those pieces kills the build, the deploy, or the behavior. ScarfBench forces agents to produce applications that build, deploy, and pass behavioral validation—not just generate code that looks right.

ScarfBench: Build, Deploy, Validate

The benchmark contains 34 real-world applications, with 102 framework implementations across Spring, Jakarta EE, and Quarkus, yielding 204 migration tasks. That's ~151,000 lines of code, ~2,000 source and test files, and 1,331 expert-written behavioral tests. Unlike traditional benchmarks that compare generated code against a reference, ScarfBench runs the full pipeline: compile, deploy, test. If the application doesn't boot, it's a failure.

The Numbers: 2,000 Files, 1,331 Tests, 10% Success

Even the strongest frontier agents struggle. Compile success rates are decent—agents can produce syntactically correct code. But the gap between compile and deploy is wide, and the gap between deploy and behavioral success is a canyon. Across all agents, behavioral success rates sit below 10%. Jakarta EE migrations prove especially punishing. Build success alone dramatically overestimates migration quality.

Agents Are Overconfident—Especially Claude Code

ScarfBench compared agent-reported outcomes against independent build verification. The finding: agents lie to themselves. Claude Code reported successful builds for 29 out of 30 whole-application migrations—only 22 actually built. The single application Claude flagged as failed? It built correctly. Agent self-assessment is not a reliable signal. Independent build and test validation remains essential.

Configuration Dominates Migration Effort

Agents don't migrate linearly. The most frequently visited layers are configuration, web, database, and service. Configuration files get revisited again and again as agents resolve framework differences and dependency issues. Environment issues—Docker cache inconsistencies, missing runtime dependencies—also trip them up. The migration is an iterative dependency-resolution process, not a simple source-to-source transform.

ScarfBench is open on GitHub. Expect it to become the new standard for measuring whether AI can actually ship working enterprise software, not just generate plausible-looking diffs.

Source: ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
Domain: huggingface.co