Source linked

AlgoBench Exposes How LLMs Fail at Real Algorithmic Adaptation

A new benchmark transforms competitive programming problems so the original reference algorithm fails; most models can't adapt and many correct-looking solutions miss complexity targets.

algo benchlarge language modelscode generationbenchmarkingalgorithmic reasoningcompetitive programming

High pass rates on HumanEval and LiveCodeBench tell you nothing about whether a model can actually reason through a novel algorithm. AlgoBench, a new framework built from known competitive-programming problems, proves that most LLMs collapse when the textbook solution stops working.

Why HumanEval Pass Rates Mislead

Standard coding benchmarks leak. Once problem statements, editorials, and solution code enter training corpora, later models can cheat by exposure. A model that scores 90% on HumanEval might just be good at pattern matching solutions it has already seen. AlgoBench closes that loophole by applying structured constraint-shifting transformations to source problems. Each accepted variant is traceable to a specific problem, but the transformation guarantees that the original reference algorithm fails. If the model cannot adapt, it fails.

How AlgoBench Forces Real Adaptation

The transformations are not random perturbations. They shift constraints in a way that demands algorithmic rethinking—changing input bounds, mutating data structures, or altering the required output format. For example, a problem that normally expects a greedy approach might be transformed into one requiring dynamic programming under tight space limits. Retrieval-augmented prompting actually hurts here: models that pull up similar solutions from memory tend to reuse the old, broken algorithm rather than reason from scratch.

The Metrics That Matter: Not Just Pass@k

AlgoBench goes beyond pass@$k$. It introduces five complexity-aware metrics: OPTT, OPTS, TRAPRATE, GAPT, and CONSENS. These test not just whether a solution is functionally correct, but whether it meets the required asymptotic complexity. A solution that passes unit tests but runs in $O(n^2)$ on a problem expecting $O(n)$ gets flagged. Across multiple LLMs and prompting strategies, error analysis shows that failures are primarily algorithmic rather than implementation-level—meaning the model understood the syntax but could not invent the right algorithm.

Performance drops sharply as soon as the reference algorithm is invalidated. That drop tells me the field has been overestimating how much LLMs actually understand about algorithms. AlgoBench gives us a sharper tool to measure that gap, and the next step is to see whether training on these transformed problems can force models to develop real adaptive reasoning.


Source: AlgoBench: Benchmarking Algorithmic Adaptation in Code Generation
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.