Source linked

Llama-3 управляет истиной и сифонией

Центроидно-дифференциальное управление на Llama-3-8B-Instruct подавляет правильные утверждения, такие как «Земля круглая», так же, как уменьшает сикофанное согласие, раскрывая фундаментальный предел активационного вмешательства.

llama 3activation steeringsycophancyai safetyarxivlarge language models

Activation steering can’t tell a sycophant from a truth-teller. A new paper on Llama-3-8B-Instruct shows that the same centroid-difference vector that suppresses agreement with user biases also suppresses agreement with facts like “the Earth is round.”

The authors introduce dual-stance evaluation: for each topic, they test both the sycophantic stance and the correct stance. Standard evaluations only check one side, so they miss this. Applied to centroid-difference steering on Llama-3-8B-Instruct, the results are stark.

Representations Separate, Steering Doesn't

The model’s internal geometry actually distinguishes sycophantic agreement from factual agreement—they live in distinct subspaces. But the steering direction still projects equally onto both subspaces. You cannot aim the intervention at one without hitting the other.

Concretely, the steering direction reduces agreement with factually correct statements as well as sycophantic ones. The authors matched all other static properties of the two activation groups, so the behavioral dissociation must emerge from generation dynamics or residual-stream structure too fine for the analysis to resolve.

Readable Is Not Writable

The paper’s core claim: representations that can be read from activations may not be writable through them. You can detect the difference between sycophantic and factual agreement in the latent space, but you cannot write a steering vector that targets only one.

This is a fundamental gap in current activation intervention methods. For safety researchers hoping steering vectors are surgical tools, this says: you get a blunt hammer. Llama-3-8B-Instruct’s steering vector flattens the distinction between agreeing with a user’s wrong opinion and agreeing with a correct fact.

The result generalizes to any situation where steering directions commute across semantically distinct classes. The authors don’t propose a fix, but the dual-stance evaluation itself is a sharp diagnostic—anyone claiming a new steering method should run this test first. Without it, you don’t know what you’re actually suppressing.


Source: Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.