Trigger Leakage in Vision-Language Agents Hits 99% - New Metric Exposes Backdoor Imprecision

A backdoor trigger that hits 99.6% of unintended neighbors is not a backdoor — it's a landmine. That's the finding in a new paper that introduces Neighbor Leakage Rate (NLR) to measure how precisely vision-language agentic systems (VLAS) activate on the intended trigger and only the intended trigger.

Current evaluations obsess over Attack Success Rate (ASR) and clean accuracy. Those metrics tell you whether a trigger works, but not whether it activates on inputs that merely look or smell like the trigger. The authors formalize this failure as "trigger leakage": inputs visually or semantically close to the intended trigger that inadvertently invoke the attacker's hidden behavior.

Icon and Text Triggers Leak Badly

At just a 3% poisoning ratio, icon triggers hit an NLR of 0.996 — meaning virtually every input the authors considered a neighbor activated the backdoor. Text triggers fared only slightly better at 0.944. Common visual transformations didn't help: the triggers remained robust to those, but their neighboring variants still leaked heavily.

Using textual triggers as a controlled probe, the team found that standard fine-tuning learns a broad activation region rather than an exact trigger condition. Neighboring strings invoke the malicious behavior even when the exact trigger is absent. This isn't a minor edge case — in image-editing and embodied-manipulation workflows, leaked triggers can propagate into executable programs and action sequences.

Hard-Negative Samples Tighten the Trigger

The fix? Adding edit-distance-one hard-negative samples during training substantially narrows that activation region. The paper reports that this reduces leakage across multiple VLAS pipelines. It's a concrete mitigation: make the model learn the precise boundary of the trigger instead of a fuzzy blob.

For anyone building agents that connect vision to tool use or physical actions, this shifts the threat model. Attack success rate alone is a dangerously incomplete metric. NLR should become a standard part of any backdoor evaluation on agentic systems — because a trigger that can't tell its friends from its enemies is worse than useless.

Source: Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems
Domain: arxiv.org

Trigger Leakage in Vision-Language Agents Hits 99% - New Metric Exposes Backdoor Imprecision

Icon and Text Triggers Leak Badly

Hard-Negative Samples Tighten the Trigger

More in Artificial Intelligence