Source linked

Training One Transformer Layer erholt volle RL Fine-Tune-Gewinne

Zhang et al. zeigen, dass das Fein-Tuning nur einer mittleren Transformerschicht mit RL-Matches - und manchmal Schläge - Vollparameter-Training über Qwen3, Qwen2.5 und drei RL-Algorithmen.

qwen3qwen25grpogigpodr grpotransformer layers

Fine-tuning a single transformer layer with reinforcement learning can match — and in some cases exceed — the gains from updating every parameter in the model. That's the core finding from Zhang et al. in their July 2026 paper, tested across seven models, two families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and domains including math reasoning, code generation, and agentic decision-making.

The Middle Layers Carry the Weight

They introduced a metric called layer contribution — the fraction of full-parameter RL improvement recovered by training just that one layer in isolation. The pattern is remarkably consistent: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. This holds across datasets, tasks, model families, and RL algorithms. The layer rankings are even strongly correlated between experiments.

Why This Matters for RL Post-Training

Full-parameter RL training is expensive. If you can drop 90% of the parameter updates and still get the same lift, that's a direct reduction in compute, memory, and training time. More importantly, it suggests that the RL adaptation signal is far more localized than the community assumed. Zhang et al. didn't just observe this — they quantified it and showed it generalizes.

What This Unlocks

Targeted fine-tuning of specific layers opens the door to faster iteration cycles, lower-cost alignment runs, and potentially interpretable edits to model behavior by modifying only the layers that matter. The next step is understanding why the middle layers are the ones that soak up RL signal — and whether this pattern holds for other model architectures and reward schemes. If it does, the standard practice of updating all parameters uniformly during RL post-training is overdue for a rethink.


Source: Is One Layer Enough? A Single Transformer Layer Matches Full-Parameter RL Train
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.