Source linked

RepSelect Collapses Gradient Components to Make LLM Unlearning 50x Harder to Reverse

A new method isolates forget-set-specific representations by collapsing principal gradient components, achieving 4-50x better reversal resistance than five baselines across four model families.

repselectllm unlearningrepresentation selectivityarxivlarge language modelsai safety

RepSelect cuts post-relearning answer accuracy by 4 to 50 times more than the strongest existing baseline. That means an attacker who fine-tunes a supposedly unlearned model gets back far less of the forgotten knowledge.

Why Existing Unlearning Fails: Shared Representations

Current methods target representations that overlap with both the retain set and the subspace an attacker can recover via fine-tuning. That overlap is the root cause: it makes forgetting both disruptive to general capabilities and easy to reverse. GradDiff, NPO, SimNPO, RMU, and UNDIAL all suffer from this flaw.

RepSelect's Trick: Collapsing Gradient Principal Components

The RepSelect paper from arXiv (2606.17168) introduces a simple fix: before each unlearning update, collapse the top principal components of the weight gradients. This isolates only the representations specific to the forget set. General capabilities stay intact because the retain-set subspace is never touched. More importantly, an attacker's fine-tuning can only recover what little signal remains in the collapsed dimensions.

The Numbers: 4-50x Better and Near-Perfect Few-Shot Robustness

Tests spanned two forget categories (biohazardous knowledge and abusive tendencies) and four model families: Llama 3, Qwen 3.5, Gemma 4 E4B, and DeepSeek V2 Lite. Against five popular baselines, RepSelect achieved a 4-50x larger reduction in post-relearning answer accuracy. It also proved near-perfectly robust to few-shot prompting attacks, which easily broke prior methods.

Shallow forgetting is no longer the default. By cutting off the attacker's recovery path at the representation level, RepSelect turns LLM unlearning from a temporary band-aid into a genuinely hard-to-reverse edit.


Source: RepSelect: Robust LLM Unlearning via Representation Selectivity
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.