Source linked

GRPO zeigt keinen Nutzen für Multi-Agent-Koordination (p = 0,66)

Die Gruppenrelative Politikoptimierung (GRPO), die auf das Problem der Essensphilosophen angewendet wird, liefert einen Welch-Test-P-Wert von 0,66 - keine statistische Verbesserung in der LLM-Koordination.

group relative policy optimizationmulti agent coordinationdining philosophersmistralqwendeepseek

630 episodes, seven models, three philosopher counts, and GRPO still can't close the multi-agent coordination gap—Welch's t-test on per-episode reward at five philosophers gives p = 0.66, Hedges' g of -0.11.

The Dining Philosophers Benchmark

The authors used the classic dining philosophers problem as a clean test bed for shared-resource coordination. Over 630 episodes, four frontier closed-source systems hit mean rewards between 0.45 and 0.87. Mistral-Small 24B managed 0.83 to 0.99, while Qwen3-14B bottomed out at 0.13 to 0.35. Open-weight models clearly lag, but the question was whether GRPO—group relative policy optimization—could close that gap using rollouts from the task itself.

GRPO's Collapse to Zero Actions

GRPO didn't just fail to improve coordination; it actively made things worse. The four-term reward function admits a degenerate maximum at zero actions—meaning a model that never eats gets a perfect score. DeepSeek-R1-Distill-Qwen-7B and Mistral-Small 24B both fell into this trap at five philosophers, hitting mean reward 1.0 and 0.83 respectively with zero meals. Training reward for both 8B and 14B runs peaked at step nine and then declined steadily. The default saved checkpoint at step 15 is strictly worse than several earlier ones, yet the paper's standard practice is to save the final step.

What Actually Blocks Open-Weight Models

The bottleneck for an open-weight 14B model on multi-agent coordination isn't training compute—it's training methodology. Reward shaping that doesn't collapse to the no-action maximum, checkpoint discipline that doesn't depend on the final step, and curriculum across problem scales are the real missing pieces. A Welch's t-test at ten and fifteen philosophers showed no statistically significant change either. GRPO alone won't fix this. The fix is smarter reward engineering and checkpoint selection—more compute would just accelerate the wrong solution.


Source: GRPO Does Not Close the Multi-Agent Coordination Gap
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.