Source linked

PAIR-Bench Scores LLMs on Code Refinement, Not Just Pass/Fail

PAIR-Bench uses progressive hinting with two controls to measure repair trajectories, catching regressions and generalization that binary pass/fail misses.

pair bencharxivlarge language modelscode generationbenchmarkllm evaluation

Binary pass/fail on code generation benchmarks tells you nothing about whether a model actually learns from feedback or just guesses again. That's the core problem PAIR-Bench from this new arXiv paper sets out to solve.

PAIR-Bench—Progressive, Adaptive, and Interactive Benchmark—evaluates code improvement: transforming an incorrect or incomplete program into a more correct one through feedback-guided refinement. No more treating a model as a black box that either passes a test suite or doesn't.

Why Binary Pass/Fail Misses the Point

Standard evaluation protocols ignore partial progress, feedback use, regressions, and the refinement trajectory. A model might fix one bug but break another, or only pass because the hints were too generous. PAIR-Bench exposes all of that.

The benchmark introduces two controls under a protocol called progressive hinting. Failure-region control groups hidden failing tests into failure scenarios, determining what the feedback targets. Hint-depth control determines how much repair-relevant information is revealed—from coarse symptoms down to implementation-level guidance.

PAIR-Bench's Progressive Hinting Design

This design lets PAIR-Bench measure whether a model repairs targeted failures, generalizes beyond the hint, preserves already-correct behavior, and how much assistance it requires. You get a trajectory, not a binary score.

Instead of asking "did the final program pass?" you ask "how much help did the model need to get there, and did it break anything along the way?" That's a far more useful signal for anyone shipping code or building AI coding assistants.

What This Means for Evaluating LLM Code Capabilities

PAIR-Bench doesn't just give a single pass rate—it produces a profile of model behavior across refinement steps. This lets researchers pinpoint where models struggle: do they overfit to the hint and regress on unrelated functionality, or do they generalize the fix?

The benchmark is adaptive by design. Models that succeed with minimal hint depth score higher, while those that need the full implementation spoon-fed get penalized. That aligns evaluation with real-world code review, where you don't get to brute-force answers.

PAIR-Bench sets a new standard for evaluating code improvement, and we should expect future models to be judged on how they refine, not just generate.


Source: Benchmarking Code Improvement with Progressive, Adaptive, and Interactive Feedback
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.