Source linked

UOJ-Bench: Top LLMs Miss Over 50% of Flawed Code in Hacking Task

A new benchmark built from real competitive programming submissions shows even the best LLMs fail to identify errors in more than half of known-incorrect code-test-time scaling pushes past 90% but at prohibitive cost.

uoj benchcompetitive programmingllm code generationcode repaircode hackingbenchmark

More than 50% of incorrect competitive programming submissions were missed by the strongest LLMs under one-shot evaluation in a new benchmark called UOJ-Bench. That's not a toy dataset—these are real human-written solutions from the Universal Online Judge (UOJ), each flagged as incorrect by the platform's community. The benchmark tests three tasks: code generation, code hacking (identifying errors in human code), and code repair. And the results cut through the hype about LLMs as coding assistants.

Three Tasks That Go Beyond Code Generation

Most coding benchmarks stop at "can the model write a correct solution to a problem statement." UOJ-Bench adds two harder skills: spotting bugs in someone else's code (hacking) and then fixing them (repair). The tasks are built from actual UOJ submissions, not synthetic examples, and evaluated directly on UOJ's native judging infrastructure. That means the ground truth is battle-tested by human competitors—no constructed edge cases.

The hacking task is where the models really bleed. Under one-shot prompting, even the best frontier LLMs could not identify errors in more than 50% of submissions that UOJ users had already proven incorrect. For a developer relying on an LLM to review pull requests, that's a hard number to ignore.

The 50% Failure Floor and the Test-Time Scaling Tax

When the authors cranked up test-time scaling—allowing models to spend more compute per problem—success rates jumped above 90%. But the cost is brutal: the paper notes that the inference expense becomes a practical barrier for large-scale deployment. You don't get that kind of accuracy without burning serious GPU cycles per query. The trade-off between one-shot failure and scaled compute is exactly the kind of engineering constraint that matters in production.

Still, the results aren't all bad. The best models under test-time scaling managed to uncover errors in over 5% of full-score submissions across roughly 30 problems. That means there are correct-looking solutions that UOJ's test suite passes but an LLM can flag as buggy. In competitive programming, that's a signal the standard judging system is blind to.

Complementary Signals Beyond Standard Judging

UOJ-Bench makes a clear case: LLMs aren't replacements for online judges—they're complementary error detectors. The 5% hit rate on full-score submissions suggests frontier models can catch hidden defects that existing test cases miss. That's useful for educators, platform operators, and anyone running automated code assessment.

The practical takeaway: if you need to identify bugs in human code cheaply, today's LLMs will miss half of them. If you can afford the compute tax, you can push past 90%—but you'd better have the budget. UOJ-Bench sets a new bar for evaluating LLMs not as code writers, but as code reviewers.


Source: Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.