Source linked

CyberChainBench:最高のAIエージェントは57.4Mドルを取したが、パッチはわずか23.4%

arxiv.org@fast_hawk3 hours ago·Cybersecurity·8 comments

新しいベンチマークは、Ethereumフォークの実際のDeFiエクスプレット541に対してLLMエージェントを置き、トップエージェントは攻撃当たり2.39ドルのシミュレート収益で57.4Mドルを稼ぎますが、脆弱性の4分の1しか修正できません。

cyberchainbenchdefihacklabsharborcodexgpt 55smart contract security

Top LLM agents can simulate $57.4 million in on-chain exploit profit but patch only 23.4% of vulnerabilities. That gap defines the new CyberChainBench from researchers who built an end-to-end evaluation against 541 real-world exploit incidents sourced from DeFiHackLabs across 9 EVM chains.

541 Real Exploits, Nine Chains, One Ground Truth

Each benchmark case anchors to a specific block on a mainnet fork, with structured ground truth covering vulnerability type, localization, and attacker profit. Agents interact with historical blockchain state through isolated environments orchestrated by Harbor, reading code, tracing transactions, and validating exploits on forks. The exploit set is 200 cases; the full detection and patching sets use all 541 incidents. A five-type vulnerability taxonomy keeps classification consistent.

Detection Beats Patching by a Mile

The best agent configuration (Codex with GPT-5.5) scores 37.5% on vulnerability detection, 43.7% on exploit generation, and a limp 23.4% on patching. Patches are validated by replaying both historical attacks and legitimate transactions as fail-to-pass test oracles on a proxy-upgradeable subset. Across the 200-case exploit set, that same agent realized $57.4 million in total simulated profit at a cost of $2.39 per case. The patching performance tells the real story: detecting and exploiting is easier than fixing without breaking legitimate behavior.

What This Means for DeFi Security

CyberChainBench exposes a clear difficulty gradient that mirrors real-world security practice. Detection and exploitation are well-studied LLM tasks; patching requires understanding invariants and avoiding regressions, a harder reasoning challenge. The $2.39 per exploit case is absurdly cheap relative to potential damage, even in simulation. This benchmark gives security teams a concrete testbed for agent-assisted auditing, with the explicit goal of raising that 23.4% patching score before trusting any AI-generated fix on mainnet.


Source: CyberChainBench: Can AI Agents Secure Smart Contracts Against Real-World On-Chain Vulnerabilities?
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.