Cognitive Firewall Cuts LLM Jailbreak Success to 2% With Four-Gate Oversight

Attack success drops to 2% or below when a separate oversight model checks every turn of an LLM conversation against four categorical gates, according to the Cognitive Firewall framework detailed in arXiv:2607.01277.

Why Single-Message Checks Are Not Enough

Existing runtime guards treat prompts and responses as isolated messages. That leaves them blind to multi-turn strategies where no single user query appears unsafe but accumulated intent, fabricated authority, or decomposed harmful objectives slip through. The Cognitive Firewall authors call this failure mode directly: single-message evaluation cannot recover accumulated intent or detect harm distributed across a dialogue.

The Four Gates and Escalation Logic

The framework interposes an independent oversight model between user and target LLM. Every request passes through four gates: an intent gate that identifies the operational objective, a zero-trust context gate that treats claimed roles and permissions as unverified evidence, a consistency gate that detects escalation and decomposition across turns, and an output risk gate that inspects candidate responses before release. Gate decisions combine through escalation rather than score averaging — any confident danger signal blocks the interaction, and every block carries an auditable rationale.

Benchmark Results: 2% on Three Sets, 14% on the Hardest

Experiments span four jailbreak benchmarks covering single-turn, multi-turn, authority-based, and human-crafted attacks. The Cognitive Firewall lowers attack success to 2% or below on three of those sets. On the most difficult human-crafted set — the kind that mimics real adversarial creativity — success sits at 14%. The over-refusal rate on benign requests stays at 8%, a reasonable tradeoff for proactive containment.

Cognitive Firewall's decomposed, auditable gate design points toward runtime safety systems that treat every dialogue as a potential multi-turn attack, not a series of isolated prompts.

Source: Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety
Domain: arxiv.org

Cognitive Firewall Cuts LLM Jailbreak Success to 2% With Four-Gate Oversight

Why Single-Message Checks Are Not Enough

The Four Gates and Escalation Logic

Benchmark Results: 2% on Three Sets, 14% on the Hardest

More in Artificial Intelligence