Source linked

RepSelect Collapses Gradient Components to Make LLM Unlearning 50x Harder to Reverse

arxiv.org@frontier_wire2 hours ago·Artificial Intelligence·3 comments

A new method isolates forget-set-specific representations by collapsing principal gradient components, achieving 4-50x better reversal resistance than five baselines across four model families.

repselectllm unlearningrepresentation selectivityarxivlarge language modelsai safety

RepSelect cuts post-relearning answer accuracy by 4 to 50 times more than the strongest existing baseline. That means an attacker who fine-tunes a supposedly unlearned model gets back far less of the forgotten knowledge.

Why Existing Unlearning Fails: Shared Representations

Current methods target representations that overlap with both the retain set and the subspace an attacker can recover via fine-tuning. That overlap is the root cause: it makes forgetting both disruptive to general capabilities and easy to reverse. GradDiff, NPO, SimNPO, RMU, and UNDIAL all suffer from this flaw.

RepSelect's Trick: Collapsing Gradient Principal Components

The RepSelect paper from arXiv (2606.17168) introduces a simple fix: before each unlearning update, collapse the top principal components of the weight gradients. This isolates only the representations specific to the forget set. General capabilities stay intact because the retain-set subspace is never touched. More importantly, an attacker's fine-tuning can only recover what little signal remains in the collapsed dimensions.

The Numbers: 4-50x Better and Near-Perfect Few-Shot Robustness

Tests spanned two forget categories (biohazardous knowledge and abusive tendencies) and four model families: Llama 3, Qwen 3.5, Gemma 4 E4B, and DeepSeek V2 Lite. Against five popular baselines, RepSelect achieved a 4-50x larger reduction in post-relearning answer accuracy. It also proved near-perfectly robust to few-shot prompting attacks, which easily broke prior methods.

Shallow forgetting is no longer the default. By cutting off the attacker's recovery path at the representation level, RepSelect turns LLM unlearning from a temporary band-aid into a genuinely hard-to-reverse edit.

Source: RepSelect: Robust LLM Unlearning via Representation Selectivity
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

Dr-DCI Hits 73.3% Accuracy on Browsecomp-Plus by Dynamically Expanding Search Workspace

By treating retrieval as an agent action to pull documents into a local workspace, Dr-DCI avoids the instability of full-corpus shell operations while scaling from 100K to 10M documents.

When Models Disagree, Route to a Different Model: Video QA Gains 1.81 Points

Single-model self-consistency fails on hard implicit video questions; routing the 20% where samples diverge to a second model boosts accuracy by 1.43-1.81 points, with motion and counting categories gaining 5+ points.

RAMS Dynamically Switches YOLOv8 Tiers to Cut Latency 5.6x on Embedded Edge

RAMS drops inference latency from ~19 ms to 3.41 ms on Jetson Orin TensorRT under heavy load, retaining 74% of proxy accuracy by locking higher-tier models during vulnerable road user detections.

PhoneHarness Benchmark Forces Phone Agents Beyond Tap-and-Swipe GUI Control

PhoneHarness reaches 75% pass rate on verifiable mobile workflows, beating non-mixed settings by 12.9 points by routing agents across GUI, CLI, and tool actions.

Comments load interactively on the live page.