Zero-Cost MLM-Head Rescaling Fixes SPLADE Training Collapse

A simple initialization correction rescales MLM-head projections to prevent training instability in learned sparse retrieval, enabling large pretrained encoders to match or beat BERT-SPLADE.

splademodernbertettinbertlearned sparse retrievalmlm head

Large-norm backbones like ModernBERT and Ettin cause SPLADE training to collapse under standard recipes. This isn't a capacity problem - it's a scale mismatch in the MLM head that feeds into sparse lexical representations.

The issue is elegant and brutal: SPLADE directly uses MLM-head logits as sparse term signals, and query-document relevance is computed by an unnormalized dot product. When a backbone has an inflated MLM-head L2 norm, those outputs amplify sparse activations, distort matching scores, and destabilize the contrastive training loop. The result? Training diverges or never reaches competitive retrieval quality.

The Fix: One Scalar at Initialization

Authors of the paper (arXiv:2606.18811) introduce a deceptively simple correction - rescale the MLM-head projection matrix by a constant factor right after initialization, before any SPLADE training happens. No architecture changes, no objective function tweaks. Zero cost, one line of code.

This rescaling brings the MLM-head's output distribution into a range compatible with SPLADE's unnormalized dot product, preventing the activation explosion that kills training. Across both in-domain and out-of-domain retrieval benchmarks, the correction turns unstable ModernBERT and Ettin runs into competitive sparse retrievers that match or surpass the classic BERT-SPLADE baseline.

Bottleneck Wasn't Encoder Capacity

The finding cuts against the assumption that stronger encoders automatically yield better LSR models. The bottleneck was calibration of the MLM-head scale - not encoder capacity. That means past failed attempts to upgrade SPLADE's backbone might have been artifacts of this initialization mismatch, not fundamental limits of the encoder.

This paper gives practitioners a concrete, testable patch before trying a new backbone. Expect to see initialization-time normalization become a standard step in sparse retrieval training recipes whenever a non-BERT encoder is swapped in.

Source: Rescaling MLM-Head for Neural Sparse Retrieval
Domain: arxiv.org

Zero-Cost MLM-Head Rescaling Fixes SPLADE Training Collapse

The Fix: One Scalar at Initialization

Bottleneck Wasn't Encoder Capacity

More in Artificial Intelligence