CyberChainBench : le meilleur agent de l’IA exploite 57,4 millions de dollars mais ne patche que 23,4%

Top LLM agents can simulate $57.4 million in on-chain exploit profit but patch only 23.4% of vulnerabilities. That gap defines the new CyberChainBench from researchers who built an end-to-end evaluation against 541 real-world exploit incidents sourced from DeFiHackLabs across 9 EVM chains.

541 Real Exploits, Nine Chains, One Ground Truth

Each benchmark case anchors to a specific block on a mainnet fork, with structured ground truth covering vulnerability type, localization, and attacker profit. Agents interact with historical blockchain state through isolated environments orchestrated by Harbor, reading code, tracing transactions, and validating exploits on forks. The exploit set is 200 cases; the full detection and patching sets use all 541 incidents. A five-type vulnerability taxonomy keeps classification consistent.

Detection Beats Patching by a Mile

The best agent configuration (Codex with GPT-5.5) scores 37.5% on vulnerability detection, 43.7% on exploit generation, and a limp 23.4% on patching. Patches are validated by replaying both historical attacks and legitimate transactions as fail-to-pass test oracles on a proxy-upgradeable subset. Across the 200-case exploit set, that same agent realized $57.4 million in total simulated profit at a cost of $2.39 per case. The patching performance tells the real story: detecting and exploiting is easier than fixing without breaking legitimate behavior.

What This Means for DeFi Security

CyberChainBench exposes a clear difficulty gradient that mirrors real-world security practice. Detection and exploitation are well-studied LLM tasks; patching requires understanding invariants and avoiding regressions, a harder reasoning challenge. The $2.39 per exploit case is absurdly cheap relative to potential damage, even in simulation. This benchmark gives security teams a concrete testbed for agent-assisted auditing, with the explicit goal of raising that 23.4% patching score before trusting any AI-generated fix on mainnet.

Source: CyberChainBench: Can AI Agents Secure Smart Contracts Against Real-World On-Chain Vulnerabilities?
Domain: arxiv.org

CyberChainBench : le meilleur agent de l’IA exploite 57,4 millions de dollars mais ne patche que 23,4%

541 Real Exploits, Nine Chains, One Ground Truth

Detection Beats Patching by a Mile

What This Means for DeFi Security

More in Cybersecurity