Source linked

Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and Q/K--V Asymmetry

A systematic study of weight matrix singular value spectra during transformer pretraining reveals three phenomena that fundamentally change how we understand transformer training.

transformer-trainingweight-matrix-svdspectral-structuremodel-scalinglayer-importancepruning

The authors present the first systematic study of weight matrix singular value spectra during transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). They discover three phenomena: transient compression waves, persistent spectral gradients, and Q/K--V functional asymmetry. The transient compression waves create a dramatic gradient that peaks early then reverses, while the persistent spectral gradients form a non-monotonic inverted-U in deeper models. The Q/K--V functional asymmetry reveals that rank and spectral shape encode fundamentally different information about training. The authors formalize this as a two-timescale dynamical model and derive scaling laws. They validate on nine models across three families and demonstrate that spectral-guided pruning outperforms Last-N heuristics by 1.1--3.6 times across seven models in two families.


Source: The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.