Source linked

Cognitive Firewall Cuts LLM Jailbreak Success to 2% With Four-Gate Oversight

A new zero-trust framework decomposes safety into four gates-intent, context, consistency, output risk-and drives attack success below 2% on three benchmarks while keeping false refusals at 8%.

cognitive firewallllm safetyzero trustjailbreak attacksmulti gate frameworklarge language models

Attack success drops to 2% or below when a separate oversight model checks every turn of an LLM conversation against four categorical gates, according to the Cognitive Firewall framework detailed in arXiv:2607.01277.

Why Single-Message Checks Are Not Enough

Existing runtime guards treat prompts and responses as isolated messages. That leaves them blind to multi-turn strategies where no single user query appears unsafe but accumulated intent, fabricated authority, or decomposed harmful objectives slip through. The Cognitive Firewall authors call this failure mode directly: single-message evaluation cannot recover accumulated intent or detect harm distributed across a dialogue.

The Four Gates and Escalation Logic

The framework interposes an independent oversight model between user and target LLM. Every request passes through four gates: an intent gate that identifies the operational objective, a zero-trust context gate that treats claimed roles and permissions as unverified evidence, a consistency gate that detects escalation and decomposition across turns, and an output risk gate that inspects candidate responses before release. Gate decisions combine through escalation rather than score averaging — any confident danger signal blocks the interaction, and every block carries an auditable rationale.

Benchmark Results: 2% on Three Sets, 14% on the Hardest

Experiments span four jailbreak benchmarks covering single-turn, multi-turn, authority-based, and human-crafted attacks. The Cognitive Firewall lowers attack success to 2% or below on three of those sets. On the most difficult human-crafted set — the kind that mimics real adversarial creativity — success sits at 14%. The over-refusal rate on benign requests stays at 8%, a reasonable tradeoff for proactive containment.

Cognitive Firewall's decomposed, auditable gate design points toward runtime safety systems that treat every dialogue as a potential multi-turn attack, not a series of isolated prompts.


Source: Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.