Source linked

المعايير الخفية لبرنامج الماجستير تصل إلى عشرات الآلاف تحت مراقبة عميقة

arxiv.org@proud_squirrelyesterday·Machine Learning·6 comments

أظهرت الأبحاث الجديدة أن التفكير المباشر في التفكير المتكرر لا يستطيع السيطرة على مقياس الحالة المختلطة المتكررة، والرسائل غير المتغيرة مثل RMSNorm تسمح بتفكيك المعايير إلى 10،000+ على الرغم من مراقبة ضخمة.

arxivlooped language modelstransformersrmsnormlayer normhidden state scale

In 44M and 129M looped transformers, hidden-state norms explode to tens of thousands even when every loop receives dense cross-entropy supervision.

That's the readout blind spot identified in a new arXiv preprint. The authors demonstrate that dense per-loop cross-entropy controls only the variables the readout exposes. Variables hidden from the loss-derivative path remain entirely unregulated.

The Scale-Invariant Readout Trap

Scale-invariant readouts like RMSNorm and LayerNorm normalize the hidden state before feeding it to the output projection. That normalization hides the radial scale from the cross-entropy loss. Pre-norm residual recurrence keeps carrying and updating that same hidden scale, but the loss never sees it.

Consequence: early exits become usable (the loss can train them via dense supervision), but the recurrent scale runs wild. In experiments without inter-loop normalization, per-loop cross-entropy through RMSNorm readouts drove final hidden-state norms into the thousands or even tens of thousands.

The Fix: Two Paths to Scale Control

The paper proposes a straightforward design rule. Dense supervision trains exits; recurrent scale control requires separate handling.

First path: use scale-visible readouts that don't normalize away the norm. Second path: add explicit norm penalties to the loss function. Both approaches keep hidden-state norms in the tens instead of the thousands. The complementary architectural fix is scale-removing recurrence (normalizing in the loop itself).

Consistent with the rule, scale-controlled variants achieved lower perplexity at matched inference-depth operating points in the authors' variable-depth benchmarks.

This design rule gives looped LM architects a clear binary choice: make scale visible to a loss term, or remove it from the recurrence entirely. Neglecting either leaves a blind spot that can quietly destabilize a model at scale.

Source: Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Machine Learning

view topic

χsao GPU Optimizer Finds All Modes at 34x-39x Speedup

A GPU-native population optimizer achieves 100% mode recovery on all 42 SFU benchmark functions up to dimension 64, with speedups up to 39x over basin-hopping, all derivative-free.

SOLAR Framework Derives Speed-of-Light Bounds from PyTorch and JAX Code

A new framework automatically computes the theoretical minimum runtime for deep learning models, validated with zero violations across KernelBench and robotics workloads.

ML-Driven Cache Hits 97% Utilization by Predicting What Users Actually Want

A new framework called ML CPCO predicts content popularity at the user and cluster level, accounting for the fact that 20% of users drive 80% of traffic, and achieves nearly 97% cache utilization in D2D networks.

New MARL Framework Cuts Privacy Leakage 85% for 6G VR Slices

A mobility-driven multi-agent reinforcement learning approach achieves 34% higher throughput while using 28% fewer resources and slashing privacy leakage by 85%.

Comments load interactively on the live page.