Source linked

Malicious Plugins Succeed 100% of the Time in Claw-like AI Agents

arxiv.org@threat_watch3 hours ago·Cybersecurity·2 comments

A new benchmark shows malicious Plugins bypass every LLM tested, and even the best defense only cuts success rates to 22%.

safeclawarenaopenclawseclawai agent securitysupply chain attacksuc berkeley

Malicious Plugins in Claw-like AI agents succeed 100% of the time—no LLM, no platform stops them. That’s the headline from SafeClawArena, a new security benchmark from UC Berkeley’s Sunblaze lab.

SafeClawArena runs 406 adversarial tasks across four attack surfaces: Skill Supply-Chain Integrity, Persistent State Exploitation, Cross-Boundary Data Flow, and Indirect Prompt Injection. Each task executes inside containerized replicas of real agent platforms with canary-marked credentials, tracked via automated taint analysis across nine output channels.

Three Platforms, Five LLMs, One Grim Picture

Evaluating OpenClaw, NemoClaw, and SeClaw with five frontier LLMs, the highest attack success rate hit 70%. GPT-5.4 was the worst offender. Claude-Opus-4.6 was the least vulnerable, but still sat near a 22% floor on every platform—meaning even the best language model can’t overcome sloppy runtime security.

SeClaw deserves attention: it cuts GPT-5.4’s success rate from 70% down to 22%. But the paper is blunt—that improvement comes partly from utility-security tradeoffs, not active defenses. You lose capability to gain safety.

The Blind Spot: Cross-Component Failure Modes

The researchers frame Claw-like agents as agentic computer systems: the gateway runtime acts like an OS, Skills resemble user-installed applications, and Plugins behave like loadable extensions with runtime privileges. Each has a classical security counterpart—but the decades of OS protection mechanisms (sandboxing, capability systems, privilege separation) are simply absent on the agent side.

Existing benchmarks only measure model responses and tool calls. SafeClawArena measures what happens when an attacker chains a malicious Plugin with a persistent state exploit and a prompt injection—the cross-component failures that real attackers will use.

What This Means for Anyone Building Agents

If you’re shipping a Claw-like agent—persistent, credentialed, with file and tool access—you have a security gap that no prompt engineering or model choice will fix. The code and data are on GitHub at sunblaze-ucb/SafeClawArena. The real work starts when platform builders finally treat their runtimes as operating systems worthy of proper isolation.


Source: Understanding and Evaluating Claw-like Agent Security Through a Computer-Systems Lens
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.