Source linked

KTO and GRPO Reshape LLM Internals; DPO and ORPO Make Things Worse

A mechanistic analysis of six alignment methods reveals that KTO and GRPO improve linear separability of preferences, while DPO and ORPO actively degrade it through geometric rotation and feature attenuation.

dpogrpoktoorpopposimpo

Alignment algorithms are black boxes, and the field has been flying blind. A new paper on arXiv (2606.09850) finally opens them up, and the results should worry anyone who assumes all preference-tuning methods work the same way underneath.

Researchers applied layer-wise linear probing, Sparse Autoencoders, and crosscoders to six methods—PPO, DPO, SimPO, ORPO, GRPO, KTO—across three open-weight model families. They measured exactly how each algorithm reshapes the model's latent geometry.

KTO and GRPO Build Clean Separation; DPO and ORPO Tear It Down

Preference signals consistently concentrate in early–mid or mid–late layers, but the direction of change flips depending on the objective. KTO and GRPO enhance linear separability through constructive feature sharing and sparse, high-salience recruitment—the model learns to partition preferences in a geometrically clean way.

DPO and ORPO do the opposite. They degrade separability via non-constructive geometric rotation and feature attenuation. Instead of aligning internal representations, they push them into less structured configurations. The model still behaves well on outputs, but its internal wiring gets messier. That’s a red flag for safety: you can have behavioral alignment without internal alignment.

PPO and SimPO Play It Safe; Architecture Matters

PPO and SimPO largely preserve the baseline geometry. They neither clean up nor rot the latent space. That might sound neutral, but it means they rely on the model's existing representations rather than restructuring them—a safer bet for interpretability.

Critically, all transformations exhibit architecture-dependent variability. The same algorithm can produce different internal effects across model families. Behavioral alignment is not a guarantee of uniform internal restructuring.

The paper establishes alignment as a heterogeneous intervention and argues for standardized feature-level auditing. If we keep treating alignment algorithms as monolithic black boxes, we risk deploying models whose internals look fine on surface metrics but hide degraded representations that could fail unpredictably under distribution shift.


Source: Mechanistic Analysis of Alignment Algorithms in Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.