Source linked

المعايير الخفية لبرنامج الماجستير تصل إلى عشرات الآلاف تحت مراقبة عميقة

أظهرت الأبحاث الجديدة أن التفكير المباشر في التفكير المتكرر لا يستطيع السيطرة على مقياس الحالة المختلطة المتكررة، والرسائل غير المتغيرة مثل RMSNorm تسمح بتفكيك المعايير إلى 10،000+ على الرغم من مراقبة ضخمة.

arxivlooped language modelstransformersrmsnormlayer normhidden state scale

In 44M and 129M looped transformers, hidden-state norms explode to tens of thousands even when every loop receives dense cross-entropy supervision.

That's the readout blind spot identified in a new arXiv preprint. The authors demonstrate that dense per-loop cross-entropy controls only the variables the readout exposes. Variables hidden from the loss-derivative path remain entirely unregulated.

The Scale-Invariant Readout Trap

Scale-invariant readouts like RMSNorm and LayerNorm normalize the hidden state before feeding it to the output projection. That normalization hides the radial scale from the cross-entropy loss. Pre-norm residual recurrence keeps carrying and updating that same hidden scale, but the loss never sees it.

Consequence: early exits become usable (the loss can train them via dense supervision), but the recurrent scale runs wild. In experiments without inter-loop normalization, per-loop cross-entropy through RMSNorm readouts drove final hidden-state norms into the thousands or even tens of thousands.

The Fix: Two Paths to Scale Control

The paper proposes a straightforward design rule. Dense supervision trains exits; recurrent scale control requires separate handling.

First path: use scale-visible readouts that don't normalize away the norm. Second path: add explicit norm penalties to the loss function. Both approaches keep hidden-state norms in the tens instead of the thousands. The complementary architectural fix is scale-removing recurrence (normalizing in the loop itself).

Consistent with the rule, scale-controlled variants achieved lower perplexity at matched inference-depth operating points in the authors' variable-depth benchmarks.

This design rule gives looped LM architects a clear binary choice: make scale visible to a loss term, or remove it from the recurrence entirely. Neglecting either leaves a blind spot that can quietly destabilize a model at scale.


Source: Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.