Source linked

Anthropic's Fable Guardrails Reject Innocuous Tasks Like Reading a Blog Post

Cybersecurity researchers report that Fable's keyword-based safety filters block code review, blog reading, and writing secure software - anything tangentially cyber.

anthropicfablemythoscybersecurityai safetyguardrails

Asking Anthropic's new public cybersecurity model Fable to "read a blog post" about security gets you a chat-pausing safety flag — and that's not the worst of it.

Valentina "Chompie" Palmiotti, a security researcher at IBM X-Force, put it bluntly: "Fable rejects any request that could be tangentially cyber related. Even innocuous tasks like reading a blog post." When the guardrails fire, Fable drops the conversation and says its "safety measures flagged this message for cybersecurity or biology topics." The fallback is Claude Opus 4.8, which is less capable but less restricted.

Keyword-Based Filtering Frustrates Security Pros

Matt Suiche, a cybersecurity veteran now at Tolmo, told TechCrunch the guardrails appear to be purely lexical: "It seems to be keyword based, so anything in the lexical field of 'cybersecurity' triggers the guardrails." Asking Fable to write secure code? That gets downgraded too, because the model interprets “secure code” as cybersecurity work rather than software engineering best practices.

Even a straightforward "code review" request triggers the block, according to another researcher on X. The model can't distinguish between analyzing a vulnerability for exploitation and auditing your own code for safety. The result: a tool intended to help security professionals instead treats them as potential threats.

Two Tiers of Access: Verified vs. Public

Anthropic designed these restrictions to prevent Fable from generating malware or aiding biological weapon development — a legitimate concern that traces back to its more powerful Mythos model. Mythos itself remains under tight control through Project Glasswing, restricted to hundreds of organizations in 15 countries. Fable is the public, limited release.

Researchers can apply to Anthropic’s Cyber Verification Program to get fewer restrictions on Claude for cybersecurity work. OpenAI runs a similar program called Trusted Access for Cyber. But the gap between what the public model blocks and what professionals actually need is wide. Suiche acknowledged the tradeoff: "It's better to catch more people than not enough when you do such a release and to relax the guardrails over time."

Fable’s guardrails will need to evolve fast — because a cybersecurity model that can't safely review a blog post isn't useful to the people who built the field.


Source: Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable
Domain: techcrunch.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.