Source linked

GRPO реорганизует маршрутизацию LLM для улучшения кросс-языкового фактического напоминания

Новый 100K-фактовый показатель на 12 языках показывает, что укрепление обучения через GRPO превосходит тонкости для межязычного напоминания фактов и реорганизует многоязычное маршрутизацию, уменьшая языковые потребности.

polyfactqwen 2 5 7bolmo 2 1124 7bgrporeinforcement learningcross lingual factual recall

Qwen-2.5-7B and OLMo-2-1124-7B both learn world knowledge in English but struggle to recall facts in other languages—Group Relative Policy Optimization (GRPO) fixes that better than any fine-tuning approach. I've seen plenty of multilingual fine-tuning attempts, but this paper shows why reinforcement learning is a fundamentally different lever.

PolyFact: A 100K-Fact Multilingual Stress Test

The authors built PolyFact, a parallel multilingual QA dataset with 100,000 Wikidata-grounded facts across 12 typologically diverse languages. That's not a toy: it spans languages from English to Swahili to Japanese, each fact verified against Wikidata to ensure ground-truth correctness. They tested Qwen-2.5-7B and OLMo-2-1124-7B under three regimes—light continual pretraining (CPT) on parallel data, supervised fine-tuning (SFT) on the same, and reinforcement learning via GRPO.

GRPO vs. SFT: Reinforcement Learning Wins

GRPO consistently outperformed SFT, not just on seen languages but on unseen ones too. CPT on parallel data? "Limited additional gains"—hardly worth the compute. The key insight: GRPO isn't just fitting answers; it's optimizing for consistency across languages, which drives generalization that fine-tuning can't match.

What GRPO Does Inside the Model

Mechanistic analyses reveal the internal reorganization. GRPO reduces language specialization in MLP layers and attention heads, pushing the model toward more shared cross-lingual representations. Instead of carving out language-specific circuits, the model learns to route knowledge through language-agnostic pathways. That's a concrete architectural shift, not just a loss-curve improvement.

The authors released the code, models, and dataset—so anyone can replicate or build on it. The mechanistic finding—that GRPO reduces language specialization in MLP layers and attention heads—suggests a path toward genuinely language-agnostic knowledge representations in LLMs.


Source: Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.