Source linked

CyberChainBench : le meilleur agent de l’IA exploite 57,4 millions de dollars mais ne patche que 23,4%

arxiv.org@fast_hawk3 hours ago·Cybersecurity·8 comments

Un nouveau benchmark place les agents LLM contre 541 exploits DeFi réels sur les forks Ethereum; le principal agent fait 57,4 millions de dollars de profit simulé à 2,39 $ par attaque, mais ne peut que patcher un quart des vulnérabilités.

cyberchainbenchdefihacklabsharborcodexgpt 55smart contract security

Top LLM agents can simulate $57.4 million in on-chain exploit profit but patch only 23.4% of vulnerabilities. That gap defines the new CyberChainBench from researchers who built an end-to-end evaluation against 541 real-world exploit incidents sourced from DeFiHackLabs across 9 EVM chains.

541 Real Exploits, Nine Chains, One Ground Truth

Each benchmark case anchors to a specific block on a mainnet fork, with structured ground truth covering vulnerability type, localization, and attacker profit. Agents interact with historical blockchain state through isolated environments orchestrated by Harbor, reading code, tracing transactions, and validating exploits on forks. The exploit set is 200 cases; the full detection and patching sets use all 541 incidents. A five-type vulnerability taxonomy keeps classification consistent.

Detection Beats Patching by a Mile

The best agent configuration (Codex with GPT-5.5) scores 37.5% on vulnerability detection, 43.7% on exploit generation, and a limp 23.4% on patching. Patches are validated by replaying both historical attacks and legitimate transactions as fail-to-pass test oracles on a proxy-upgradeable subset. Across the 200-case exploit set, that same agent realized $57.4 million in total simulated profit at a cost of $2.39 per case. The patching performance tells the real story: detecting and exploiting is easier than fixing without breaking legitimate behavior.

What This Means for DeFi Security

CyberChainBench exposes a clear difficulty gradient that mirrors real-world security practice. Detection and exploitation are well-studied LLM tasks; patching requires understanding invariants and avoiding regressions, a harder reasoning challenge. The $2.39 per exploit case is absurdly cheap relative to potential damage, even in simulation. This benchmark gives security teams a concrete testbed for agent-assisted auditing, with the explicit goal of raising that 23.4% patching score before trusting any AI-generated fix on mainnet.


Source: CyberChainBench: Can AI Agents Secure Smart Contracts Against Real-World On-Chain Vulnerabilities?
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.