Source linked

Zero-Cost MLM-Head Rescaling fija el colapso del entrenamiento SPLADE

Una corrección de inicialización simple reescala las proyecciones de cabeza de MLM para prevenir la inestabilidad de entrenamiento en la recuperación escasa aprendida, permitiendo que los grandes codificadores pre-entrenados coincidan o derroten a BERT-SPLADE.

splademodernbertettinbertlearned sparse retrievalmlm head

Large-norm backbones like ModernBERT and Ettin cause SPLADE training to collapse under standard recipes. This isn't a capacity problem - it's a scale mismatch in the MLM head that feeds into sparse lexical representations.

The issue is elegant and brutal: SPLADE directly uses MLM-head logits as sparse term signals, and query-document relevance is computed by an unnormalized dot product. When a backbone has an inflated MLM-head L2 norm, those outputs amplify sparse activations, distort matching scores, and destabilize the contrastive training loop. The result? Training diverges or never reaches competitive retrieval quality.

The Fix: One Scalar at Initialization

Authors of the paper (arXiv:2606.18811) introduce a deceptively simple correction - rescale the MLM-head projection matrix by a constant factor right after initialization, before any SPLADE training happens. No architecture changes, no objective function tweaks. Zero cost, one line of code.

This rescaling brings the MLM-head's output distribution into a range compatible with SPLADE's unnormalized dot product, preventing the activation explosion that kills training. Across both in-domain and out-of-domain retrieval benchmarks, the correction turns unstable ModernBERT and Ettin runs into competitive sparse retrievers that match or surpass the classic BERT-SPLADE baseline.

Bottleneck Wasn't Encoder Capacity

The finding cuts against the assumption that stronger encoders automatically yield better LSR models. The bottleneck was calibration of the MLM-head scale - not encoder capacity. That means past failed attempts to upgrade SPLADE's backbone might have been artifacts of this initialization mismatch, not fundamental limits of the encoder.

This paper gives practitioners a concrete, testable patch before trying a new backbone. Expect to see initialization-time normalization become a standard step in sparse retrieval training recipes whenever a non-BERT encoder is swapped in.


Source: Rescaling MLM-Head for Neural Sparse Retrieval
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.