Feature-Space Defense LFPM Kills Backdoors in Merged Models Without Sacrificing Accuracy

Model merging hides a backdoor problem: stitching together multiple task-specific models into a single unified model creates a surface for poisoning, and existing parameter-space defenses like task arithmetic consistently degrade clean-task performance when trying to remove the attack. That trade-off is no longer forced.

Why Parameter-Space Editing Fails

Previous backdoor defenses for merged models operate directly on parameter vectors—subtracting, adding, or scaling task vectors in weight space. The authors of the new preprint (arXiv:2606.12498) show that this approach is fundamentally limited: the backdoor signal is entangled with clean-task features at the parameter level, so any attempt to excise it also removes useful information. The result: accuracy drops sharply even as the attack is partially mitigated.

LFPM: Feature-Space Arithmetic via Cross-Task Linearity

The proposed method, Linear Feature Path Minimization (LFPM), shifts the battlefield from parameters to features. It operates under the Cross-Task Linearity (CTL) framework, which observes that learned features across different tasks are approximately linear in representation space. Instead of editing weights directly, LFPM introduces an anti-backdoor task vector—optimized to minimize a loss path-integral along the interpolation between clean and backdoored states—that suppresses the backdoor while leaving task-relevant features intact.

The optimization uses gradient accumulation along the feature-space interpolation path, computing a path-integral loss that penalizes backdoor activation without penalizing clean-task features. This is not a trivial gradient descent: the loss path-integral captures how the backdoor behavior evolves across the merged model's feature manifold, enabling targeted suppression.

Robustness Across Tuning Paradigms

Experiments span both full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) settings. LFPM consistently exhibits strong robustness against backdoor attacks, outperforming existing task-arithmetic defenses by a wide margin on both attack success rate reduction and clean-task accuracy preservation. The paper does not cite specific benchmark numbers, but the claim is clear: the feature-space approach eliminates the accuracy-robustness trade-off that plagued prior work.

LFPM opens a new direction for robust model merging—feature-space arithmetic—which should become standard practice in any multi-task model deployment pipeline.

Source: From Parameters to Feature Space: Task Arithmetic for Backdoor Mitigation in Model Merging
Domain: arxiv.org

Feature-Space Defense LFPM Kills Backdoors in Merged Models Without Sacrificing Accuracy

Why Parameter-Space Editing Fails

LFPM: Feature-Space Arithmetic via Cross-Task Linearity

Robustness Across Tuning Paradigms

More in Artificial Intelligence