Почему «машинное обучение» не удается как ярлык для большинства забывающих работы LLM

Only one class of LLM forgetting deserves the name "machine unlearning": dataset-defined deletion where the resulting model is approximately indistinguishable from retraining without that specific data. Everything else—refusal training for harmful requests, entity/knowledge removal, targeted suppression—is something else, and calling it unlearning is causing real damage to the field's incentives.

One Label, Many Implicit Guarantees

The authors of this arXiv position paper (2606.27379) argue that the term "machine unlearning" is overused in LLM research precisely because it conflates distinct objectives. Regulatory deletion obligations (e.g., GDPR right to erasure) demand a guarantee that the model behaves as if it never saw the forget set. In contrast, tasks like making a model refuse harmful outputs or suppressing a specific celebrity's name pursue policy-dependent outcomes—alignment, editing, obfuscation—that don't require retraining equivalence. Papers that label these as "unlearning" implicitly promise more than they deliver.

Metrics Reward Hiding, Not Forgetting

Because different tasks get lumped under the same label, benchmarks and metrics are reused outside their intended scope. The paper specifically calls out evaluations that reward low ROUGE scores or low forget accuracy—surface-level non-disclosure—without testing whether derived capabilities remain. A model can suppress a name in direct generation while still using that knowledge for reasoning tasks. If the metric only checks direct output, the model scores well on "unlearning" without actually deleting anything. That's not a bug; it's an incentive structure that punishes honest evaluation.

A Call for Stricter Terminology and Reference Models

The authors propose reserving "machine unlearning" for tasks that include a precisely specified forget set, a reference model trained without that set, and an evaluation that measures approximate retraining equivalence. Everything else should adopt clearer names: alignment (for refusal behaviors), editing (for targeted knowledge changes), suppression (for output filtering), obfuscation (for making information hard to extract). This isn't pedantic—the guarantees matter for legal compliance, safety audits, and reproducibility.

Until the field adopts evaluation that matches the claimed objective, unlearning benchmarks will continue rewarding models that simply hide knowledge rather than truly forget it.

Source: Position: The Term "Machine Unlearning" Is Overused in LLMs
Domain: arxiv.org

Почему «машинное обучение» не удается как ярлык для большинства забывающих работы LLM

One Label, Many Implicit Guarantees

Metrics Reward Hiding, Not Forgetting

A Call for Stricter Terminology and Reference Models

More in Artificial Intelligence