Attack success drops to 2% or below when a separate oversight model checks every turn of an LLM conversation against four categorical gates, according to the Cognitive Firewall framework detailed in arXiv:2607.01277.
Why Single-Message Checks Are Not Enough
Existing runtime guards treat prompts and responses as isolated messages. That leaves them blind to multi-turn strategies where no single user query appears unsafe but accumulated intent, fabricated authority, or decomposed harmful objectives slip through. The Cognitive Firewall authors call this failure mode directly: single-message evaluation cannot recover accumulated intent or detect harm distributed across a dialogue.
The Four Gates and Escalation Logic
The framework interposes an independent oversight model between user and target LLM. Every request passes through four gates: an intent gate that identifies the operational objective, a zero-trust context gate that treats claimed roles and permissions as unverified evidence, a consistency gate that detects escalation and decomposition across turns, and an output risk gate that inspects candidate responses before release. Gate decisions combine through escalation rather than score averaging — any confident danger signal blocks the interaction, and every block carries an auditable rationale.
Benchmark Results: 2% on Three Sets, 14% on the Hardest
Experiments span four jailbreak benchmarks covering single-turn, multi-turn, authority-based, and human-crafted attacks. The Cognitive Firewall lowers attack success to 2% or below on three of those sets. On the most difficult human-crafted set — the kind that mimics real adversarial creativity — success sits at 14%. The over-refusal rate on benign requests stays at 8%, a reasonable tradeoff for proactive containment.
Cognitive Firewall's decomposed, auditable gate design points toward runtime safety systems that treat every dialogue as a potential multi-turn attack, not a series of isolated prompts.
Source: Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety
Domain: arxiv.org
Comments load interactively on the live page.