Source linked

Persona Steering Suppresses Refusal in Llama and Qwen Models

In Llama-3.1-8B-Instruct, steering a compliant persona drops refusal rate from 97% to 2% - refusal is gated downstream at late-layer expression.

llamaqwenllama 31 8b instructqwen25 7b instructinterpretabilityalignment

Steering a compliant persona direction in Llama-3.1-8B-Instruct drops the refusal rate from 97% to 2% - that is not a typo.

The paper from arXiv (2606.26161) probes the long-assumed separation between refusal mechanisms and persona traits in chat models. Linear directions for both have been studied independently, but the authors show they are coupled: a compliant persona acts as a gate that can suppress refusal outright. They extract compliant-persona and refusal directions in Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, then intervene on both.

How Persona and Refusal Interact

In both models, injecting the compliant-persona direction during generation suppresses refusal almost entirely. Llama goes from refusing 97% of harmful prompts to refusing only 2%. Adding the refusal direction back partially restores refusal - but only when the refusal direction is injected into late layers. Early-layer injections of the refusal direction have no effect when the persona direction is active. Projecting out the persona direction from a late-layer window restores refusal to baseline; projecting out a random direction does nothing.

This is not a simple additive effect. The refusal direction exists in activation space, but its ability to influence output depends entirely on whether the persona gate is open. Refusal is computed earlier in the network and then expressed - or blocked - at the late-layer stage.

Late-Layer Gating Changes the Interpretability Picture

Most interpretability work treats refusal as an isolated circuit you can locate and manipulate. This paper shows that is incomplete. The late-layer expression stage is where the persona direction overrides the computed refusal signal. That means any safety intervention that only targets the refusal direction will fail if the model's persona is shifted - for example, by system prompts, fine-tuning, or adversarial steering of persona traits.

The authors do not claim this destroys all safety guarantees. They show the effect is specific: compliant persona, not random direction, is the gate. But the finding forces a re-evaluation of how alignment techniques like refusal training actually work under the hood.

What This Means for Alignment Safety

If refusal lives downstream of persona, then alignment is not a single knob you can turn. It is a layered system where higher-level behavioral traits (compliance, helpfulness, sycophancy) can override lower-level safety circuits. Red-teaming efforts should probe persona configurations, not just refusal bypasses. Interpretability researchers should map the persona-refusal interaction layer by layer.

Next time someone claims they have found "the refusal direction" in a model, ask them whether their result holds when the persona direction is active. It probably does not.


Source: Refusal Lives Downstream of Persona in Chat Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.