Source linked

Nur starke Lehrer besiegen wiederholte Versuche: Feedback-Studie zu 13 Modellen

Eine kontrollierte Schüler-Lehrer-Bewertung über vier harte Benchmarks zeigt, dass selbst generiertes Feedback nichts mehr hinzufügt als ungeleitete Wiederholungen; die wahre Flasche ist die Fähigkeit eines Schülers, auf Feedback zu handeln, nicht nur...

arxivfeedback based agentsmulti turn language agentsstudent teacher protocolself refinementomni math

Repeated attempts alone can explain most of the accuracy gains in multi-turn language agent interactions — self-generated feedback adds almost nothing beyond simple retrying.

The Controlled Student-Teacher Protocol That Reveals the Truth

J. Lojek and coauthors designed a protocol to separate feedback-driven improvement from gains that come from resampling, format correction, or extra test-time compute. They ran it across four benchmarks — Omni-MATH, Codeforces, BBEH Linguini, and ARC-AGI1 — using thirteen open-weight models as both students and teachers. The setup compares external feedback, self-feedback, and unguided self-refinement while varying interaction history, task difficulty, and whether the teacher has privileged information.

Self-Feedback Is a Hollow Signal

Across all settings, self-generated feedback added little beyond just letting the model retry the same question without any guidance. The paper is blunt: multi-turn improvement is often not evidence of feedback use. If a model talks to itself and gets a better answer, it’s usually just benefiting from another pass at the problem, not from any new information. Only the strongest external teachers — models that can actually provide guidance beyond a generic "try again" — produced substantial feedback-specific accuracy gains.

The Real Bottleneck: Acting on Feedback, Not Having It

Dense student-teacher interaction matrices revealed something more interesting. Interactive gains are driven more by the student's ability to use feedback than by which teacher is talking. Teacher identity still matters for a fixed student, but the bigger lever is whether the student can actually change its behavior based on what it hears. The authors argue this flips the usual assumption: it’s not about who gives the advice, but whether the agent can take it. They released the full evaluation framework at https://j-lojek.github.io/feedback-generation-is-a-bottleneck/ for anyone to run the same controlled tests on their own agents.


Source: What Drives Interactive Improvement from Feedback?
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.