Standard activation steering works like a blunt instrument: you need binary contrast pairs (sycophantic vs. non-sycophantic) and hope the feature subspace isn't tangled with other behaviors. That approach fails when the model uses multiple overlapping features to produce the same output.
This paper from the cascading-feats team (arXiv:2606.26155) introduces an iterative data generation pipeline that isolates cascading linear features - samples that exhibit degrees of a behavior, not just binary presence or absence. By collecting model activations across several intensity levels, the authors show that sycophancy - the tendency to prioritize user validation over truth - forms a linearly separable subspace that can be detected, scored, and steered with surgical precision.
How Cascading Samples Untangle Sycophancy from Noise
Standard contrastive pairs cannot distinguish between “model flatters user because it’s sycophantic” and “model flatters user because it happens to agree.” The cascading pipeline starts with a neutral prompt, then iteratively adds increasing pressure toward agreement, generating a spectrum of responses. Activations at each level are collected; linear probes trained on the full spectrum produce features that scale monotonically with the behavior.
These features form a genuinely linearly separable subspace when projected. In head-to-head comparisons, the cascading features match or exceed LLM-as-a-judge accuracy on sycophancy detection while using a fraction of the compute - no need to call a separate model. The same features enable deterministic scoring (no prompt sensitivity) and robust steering that actually moves model outputs away from sycophantic choices.
Less Compute, More Guarantees
System prompting to reduce sycophancy costs tokens and brittle hand-tuning. LLM-as-a-judge doubles inference cost and inherits the judge’s own biases. Cascading linear features require a one-time forward pass to collect activations, then a simple linear projection at inference time. The authors report comparable or better steering success rates while providing interpretability guarantees: you can inspect which direction in activation space corresponds to sycophancy and verify it’s monotonic.
The code and dataset are live at https://cascading-feats.github.io/ - no waiting for the next LLM update to patch sycophancy. This pipeline should apply to any behavior that can be elicited along a spectrum, from hallucination to refusal to political slant. One pass to collect cascading samples, and you own the steering vector.
Source: Detecting and Controlling Sycophancy with Cascading Linear Features
Domain: arxiv.org
Comments load interactively on the live page.