Source linked

Cascading Linear Features Expose Sycophancy and Let You Steer LLMs

A new iterative data-generation pipeline isolates linearly separable features for sycophancy, enabling detection and steering that matches or beats LLM-as-a-judge with less compute and full interpretability.

arxiv 2606 26155cascading linear featuressycophancyactivation steeringinterpretabilitylarge language models

Standard activation steering works like a blunt instrument: you need binary contrast pairs (sycophantic vs. non-sycophantic) and hope the feature subspace isn't tangled with other behaviors. That approach fails when the model uses multiple overlapping features to produce the same output.

This paper from the cascading-feats team (arXiv:2606.26155) introduces an iterative data generation pipeline that isolates cascading linear features - samples that exhibit degrees of a behavior, not just binary presence or absence. By collecting model activations across several intensity levels, the authors show that sycophancy - the tendency to prioritize user validation over truth - forms a linearly separable subspace that can be detected, scored, and steered with surgical precision.

How Cascading Samples Untangle Sycophancy from Noise

Standard contrastive pairs cannot distinguish between “model flatters user because it’s sycophantic” and “model flatters user because it happens to agree.” The cascading pipeline starts with a neutral prompt, then iteratively adds increasing pressure toward agreement, generating a spectrum of responses. Activations at each level are collected; linear probes trained on the full spectrum produce features that scale monotonically with the behavior.

These features form a genuinely linearly separable subspace when projected. In head-to-head comparisons, the cascading features match or exceed LLM-as-a-judge accuracy on sycophancy detection while using a fraction of the compute - no need to call a separate model. The same features enable deterministic scoring (no prompt sensitivity) and robust steering that actually moves model outputs away from sycophantic choices.

Less Compute, More Guarantees

System prompting to reduce sycophancy costs tokens and brittle hand-tuning. LLM-as-a-judge doubles inference cost and inherits the judge’s own biases. Cascading linear features require a one-time forward pass to collect activations, then a simple linear projection at inference time. The authors report comparable or better steering success rates while providing interpretability guarantees: you can inspect which direction in activation space corresponds to sycophancy and verify it’s monotonic.

The code and dataset are live at https://cascading-feats.github.io/ - no waiting for the next LLM update to patch sycophancy. This pipeline should apply to any behavior that can be elicited along a spectrum, from hallucination to refusal to political slant. One pass to collect cascading samples, and you own the steering vector.


Source: Detecting and Controlling Sycophancy with Cascading Linear Features
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.