The authors present the first systematic study of weight matrix singular value spectra during transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). They discover three phenomena: transient compression waves, persistent spectral gradients, and Q/K--V functional asymmetry. The transient compression waves create a dramatic gradient that peaks early then reverses, while the persistent spectral gradients form a non-monotonic inverted-U in deeper models. The Q/K--V functional asymmetry reveals that rank and spectral shape encode fundamentally different information about training. The authors formalize this as a two-timescale dynamical model and derive scaling laws. They validate on nine models across three families and demonstrate that spectral-guided pruning outperforms Last-N heuristics by 1.1--3.6 times across seven models in two families.
Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and Q/K--V Asymmetry
A systematic study of weight matrix singular value spectra during transformer pretraining reveals three phenomena that fundamentally change how we understand transformer training.
External source stays available while the OJO article and comment thread stay local.
More in Artificial Intelligence & Machine Learning
view topicA new multimodal ML framework combines ECG and EHR features to classify LVEF, outperforming baselines and maintaining performance under temporal validation.
A novel framework for fault diagnosis in general aviation aircraft achieves 96.2% Macro-F1 using multi-fidelity digital twins and FMEA-driven fault injection.
A novel framework for adaptive and reproducible medical image processing addresses the limitations of current medical imaging research by introducing adaptability and reproducibility.
A new methodology combines hardware and software techniques to reduce computational and memory requirements for multimodal foundation models, with implications for production systems and research.
Comments load interactively on the live page.