Source linked

إخراج المضاربة في العلامات الضوئية-اللغة تسببت في 99٪ - New Metric Exposes Backdoor Imprecision

يغيب نسبة نجاح الهجوم القياسي إذا تم تشغيل إطلاقات البوابة الخلفية على الدخول غير المرغوب فيه؛ ويظهر نموذجًا جديدًا يطلق عليه نموذج إخراج المناطق المحيطة أن 99.6% من المناطق المحيطة بالصور تسبب سلوكًا مخفيًا في المناطق المحيطة بها.

vision language agentic systemsbackdoor attacksneighbor leakage rateadversarial machine learningai securitytrigger leakage

A backdoor trigger that hits 99.6% of unintended neighbors is not a backdoor — it's a landmine. That's the finding in a new paper that introduces Neighbor Leakage Rate (NLR) to measure how precisely vision-language agentic systems (VLAS) activate on the intended trigger and only the intended trigger.

Current evaluations obsess over Attack Success Rate (ASR) and clean accuracy. Those metrics tell you whether a trigger works, but not whether it activates on inputs that merely look or smell like the trigger. The authors formalize this failure as "trigger leakage": inputs visually or semantically close to the intended trigger that inadvertently invoke the attacker's hidden behavior.

Icon and Text Triggers Leak Badly

At just a 3% poisoning ratio, icon triggers hit an NLR of 0.996 — meaning virtually every input the authors considered a neighbor activated the backdoor. Text triggers fared only slightly better at 0.944. Common visual transformations didn't help: the triggers remained robust to those, but their neighboring variants still leaked heavily.

Using textual triggers as a controlled probe, the team found that standard fine-tuning learns a broad activation region rather than an exact trigger condition. Neighboring strings invoke the malicious behavior even when the exact trigger is absent. This isn't a minor edge case — in image-editing and embodied-manipulation workflows, leaked triggers can propagate into executable programs and action sequences.

Hard-Negative Samples Tighten the Trigger

The fix? Adding edit-distance-one hard-negative samples during training substantially narrows that activation region. The paper reports that this reduces leakage across multiple VLAS pipelines. It's a concrete mitigation: make the model learn the precise boundary of the trigger instead of a fuzzy blob.

For anyone building agents that connect vision to tool use or physical actions, this shifts the threat model. Attack success rate alone is a dangerously incomplete metric. NLR should become a standard part of any backdoor evaluation on agentic systems — because a trigger that can't tell its friends from its enemies is worse than useless.


Source: Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.