Looped LLM Hidden Norms Soar to Tens of Thousands Under Dense Supervision

In 44M and 129M looped transformers, hidden-state norms explode to tens of thousands even when every loop receives dense cross-entropy supervision.

That's the readout blind spot identified in a new arXiv preprint. The authors demonstrate that dense per-loop cross-entropy controls only the variables the readout exposes. Variables hidden from the loss-derivative path remain entirely unregulated.

The Scale-Invariant Readout Trap

Scale-invariant readouts like RMSNorm and LayerNorm normalize the hidden state before feeding it to the output projection. That normalization hides the radial scale from the cross-entropy loss. Pre-norm residual recurrence keeps carrying and updating that same hidden scale, but the loss never sees it.

Consequence: early exits become usable (the loss can train them via dense supervision), but the recurrent scale runs wild. In experiments without inter-loop normalization, per-loop cross-entropy through RMSNorm readouts drove final hidden-state norms into the thousands or even tens of thousands.

The Fix: Two Paths to Scale Control

The paper proposes a straightforward design rule. Dense supervision trains exits; recurrent scale control requires separate handling.

First path: use scale-visible readouts that don't normalize away the norm. Second path: add explicit norm penalties to the loss function. Both approaches keep hidden-state norms in the tens instead of the thousands. The complementary architectural fix is scale-removing recurrence (normalizing in the loop itself).

Consistent with the rule, scale-controlled variants achieved lower perplexity at matched inference-depth operating points in the authors' variable-depth benchmarks.

This design rule gives looped LM architects a clear binary choice: make scale visible to a loss term, or remove it from the recurrence entirely. Neglecting either leaves a blind spot that can quietly destabilize a model at scale.

Source: Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models
Domain: arxiv.org

Looped LLM Hidden Norms Soar to Tens of Thousands Under Dense Supervision

The Scale-Invariant Readout Trap

The Fix: Two Paths to Scale Control

More in Machine Learning