In 44M and 129M looped transformers, hidden-state norms explode to tens of thousands even when every loop receives dense cross-entropy supervision.
That's the readout blind spot identified in a new arXiv preprint. The authors demonstrate that dense per-loop cross-entropy controls only the variables the readout exposes. Variables hidden from the loss-derivative path remain entirely unregulated.
The Scale-Invariant Readout Trap
Scale-invariant readouts like RMSNorm and LayerNorm normalize the hidden state before feeding it to the output projection. That normalization hides the radial scale from the cross-entropy loss. Pre-norm residual recurrence keeps carrying and updating that same hidden scale, but the loss never sees it.
Consequence: early exits become usable (the loss can train them via dense supervision), but the recurrent scale runs wild. In experiments without inter-loop normalization, per-loop cross-entropy through RMSNorm readouts drove final hidden-state norms into the thousands or even tens of thousands.
The Fix: Two Paths to Scale Control
The paper proposes a straightforward design rule. Dense supervision trains exits; recurrent scale control requires separate handling.
First path: use scale-visible readouts that don't normalize away the norm. Second path: add explicit norm penalties to the loss function. Both approaches keep hidden-state norms in the tens instead of the thousands. The complementary architectural fix is scale-removing recurrence (normalizing in the loop itself).
Consistent with the rule, scale-controlled variants achieved lower perplexity at matched inference-depth operating points in the authors' variable-depth benchmarks.
This design rule gives looped LM architects a clear binary choice: make scale visible to a loss term, or remove it from the recurrence entirely. Neglecting either leaves a blind spot that can quietly destabilize a model at scale.
Source: Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models
Domain: arxiv.org
Comments load interactively on the live page.