Source linked

Looped LLM Hidden Norms Soar to Tens of Thousands Under Dense Supervision

New research reveals that per-loop cross-entropy fails to control recurrent hidden-state scale; scale-invariant readouts like RMSNorm let norms balloon to 10,000+ despite dense supervision.

arxivlooped language modelstransformersrmsnormlayer normhidden state scale

In 44M and 129M looped transformers, hidden-state norms explode to tens of thousands even when every loop receives dense cross-entropy supervision.

That's the readout blind spot identified in a new arXiv preprint. The authors demonstrate that dense per-loop cross-entropy controls only the variables the readout exposes. Variables hidden from the loss-derivative path remain entirely unregulated.

The Scale-Invariant Readout Trap

Scale-invariant readouts like RMSNorm and LayerNorm normalize the hidden state before feeding it to the output projection. That normalization hides the radial scale from the cross-entropy loss. Pre-norm residual recurrence keeps carrying and updating that same hidden scale, but the loss never sees it.

Consequence: early exits become usable (the loss can train them via dense supervision), but the recurrent scale runs wild. In experiments without inter-loop normalization, per-loop cross-entropy through RMSNorm readouts drove final hidden-state norms into the thousands or even tens of thousands.

The Fix: Two Paths to Scale Control

The paper proposes a straightforward design rule. Dense supervision trains exits; recurrent scale control requires separate handling.

First path: use scale-visible readouts that don't normalize away the norm. Second path: add explicit norm penalties to the loss function. Both approaches keep hidden-state norms in the tens instead of the thousands. The complementary architectural fix is scale-removing recurrence (normalizing in the loop itself).

Consistent with the rule, scale-controlled variants achieved lower perplexity at matched inference-depth operating points in the authors' variable-depth benchmarks.

This design rule gives looped LM architects a clear binary choice: make scale visible to a loss term, or remove it from the recurrence entirely. Neglecting either leaves a blind spot that can quietly destabilize a model at scale.


Source: Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.