Source linked

DPO Learns in a Near-Orthogonal Subspace While SFT and RIFT Converge - And It Wins on GSM8K

A study of six offline reasoning training methods shows DPO's weight updates are nearly orthogonal to SFT's, achieving 93.5% GSM8K accuracy versus 87-88% for the clustered methods.

qwen3 4bdposftoffline reinforcement learningreasoninglarge language models

DPO reaches 93.5% on GSM8K - 6 points higher than any of the other five offline reasoning methods tested - and its weight deltas sit in a subspace nearly orthogonal to the SFT direction. That gap is not noise; it's structural.

A new study posted to arXiv (2606.23740) trained six methods - SFT, RFT, DFT, RIFT, Offline GRPO, and DPO - on identical math rollouts from Qwen3-4B using attention-only LoRA. The authors then compared the resulting weight deltas using cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA. The geometry tells a stark story.

Three Methods Collapse to the Same Vector

SFT, RFT, and RIFT produce nearly colinear weight deltas: cosine similarity >= 0.97, median top-1 principal angle of just 7 degrees across 144 modules. Their GSM8K accuracies cluster tightly at 87-88% (pairwise McNemar p >= 0.15). For practical purposes, these three are the same solution - the weighting of negative or filtered examples in RFT and RIFT doesn't move the needle.

DFT diverges further than any reward-weighted method despite using the same data, hinting that penalizing incorrect completions (rather than encouraging correct ones) pushes the model into a genuinely different region of weight space, though with no accuracy benefit.

DPO Occupies a Separate Subspace

DPO sits in a near-orthogonal subspace. Its weight deltas show a distinct mode-connectivity barrier, and late-layer CKA collapses to ~0.46. Yet it reaches 93.5% on GSM8K (McNemar p < 10^-9 vs. each other method) and 30.0% on AIME26, versus 3.3-10.0% for the rest. Offline GRPO also adds a substantial orthogonal component - 67% globally, up to 86% in late layers - but stays in the SFT loss basin and trails DPO on accuracy.

The authors note DPO used a 10x smaller learning rate than the others (standard convention), so the update-norm and accuracy gaps reflect both loss function and optimizer choices. A learning-rate-matched comparison remains future work.

The Geometry Tells Us Where to Look

If the loss function determines not just accuracy but the geometric structure of the solution, then methods that converge to nearly colinear updates will never discover the DPO basin, no matter how long you train. That has consequences for transfer learning, model ensembling, and distillation - picking the wrong objective family locks you into a narrow region of weight space. DPO's orthogonal subspace shows there is much more room to explore.


Source: Weight-Space Geometry of Offline Reasoning Training
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.