RepSelect cuts post-relearning answer accuracy by 4 to 50 times more than the strongest existing baseline. That means an attacker who fine-tunes a supposedly unlearned model gets back far less of the forgotten knowledge.
Why Existing Unlearning Fails: Shared Representations
Current methods target representations that overlap with both the retain set and the subspace an attacker can recover via fine-tuning. That overlap is the root cause: it makes forgetting both disruptive to general capabilities and easy to reverse. GradDiff, NPO, SimNPO, RMU, and UNDIAL all suffer from this flaw.
RepSelect's Trick: Collapsing Gradient Principal Components
The RepSelect paper from arXiv (2606.17168) introduces a simple fix: before each unlearning update, collapse the top principal components of the weight gradients. This isolates only the representations specific to the forget set. General capabilities stay intact because the retain-set subspace is never touched. More importantly, an attacker's fine-tuning can only recover what little signal remains in the collapsed dimensions.
The Numbers: 4-50x Better and Near-Perfect Few-Shot Robustness
Tests spanned two forget categories (biohazardous knowledge and abusive tendencies) and four model families: Llama 3, Qwen 3.5, Gemma 4 E4B, and DeepSeek V2 Lite. Against five popular baselines, RepSelect achieved a 4-50x larger reduction in post-relearning answer accuracy. It also proved near-perfectly robust to few-shot prompting attacks, which easily broke prior methods.
Shallow forgetting is no longer the default. By cutting off the attacker's recovery path at the representation level, RepSelect turns LLM unlearning from a temporary band-aid into a genuinely hard-to-reverse edit.
Source: RepSelect: Robust LLM Unlearning via Representation Selectivity
Domain: arxiv.org
Comments load interactively on the live page.