Source linked

Claude Fable 5 блокирует «Рак» в качестве риска биобезопасности; разработчики бунтуют

Последняя модель Anthropic направляет 0,05% запросов на более слабый отклик, но ложные положительные отзывы о доброкачественных просьбах, таких как редактирование резюме и секвенирование РНК, заставляют разработчиков кричать.

anthropicclaude fable 5mythos familyclaude opus 48ai safetyai jailbreaking

Anthropic's Claude Fable 5 flags the word 'cancer' as a biosecurity risk, routing queries to a weaker model. Derya Unutmaz, a scientist, reported this on X within two days of the model's launch. Developers also saw rejections for RNA sequencing data for sheep, résumé editing, and shopping lists.

The Safeguard That Cried Wolf

Fable 5 is the first public model derived from Anthropic's Mythos family. During training, the original Mythos showed unusual skill at finding and exploiting software bugs to disrupt or take control of systems. That scared Anthropic enough to group cybersecurity with biology and chemistry as high-risk domains. When Fable 5 detects a sensitive prompt in those areas, it downgrades the query to Claude Opus 4.8—a less capable model with its own guardrails. Anthropic says this affects about 0.05% of queries. But the classifier is so aggressive that a researcher at Bojan Tunguz called it "Anthropic overlords deciding which prompts the peasants are allowed to use."

Why Hidden vs. Visible Guardrails Backfired

Anthropic designed these classifiers to be hidden—less visible, harder to probe and work around. The company's own statement explains the tradeoff: a hidden safeguard can be narrowly targeted, but a visible one needs a wider net to be robust. "We made the wrong tradeoff and we apologize," Anthropic says. Now it's working to refine the classifiers so fewer queries trigger false positives. For Claude subscribers, query downgrades to Opus 4.8 will be more obvious. API developers will see a reason for the model's refusal.

Jailbreak Within 48 Hours

On X, researcher Pliny the Liberator claimed to bypass Fable 5's filters roughly 24 to 48 hours after launch. Pliny used a multi-agent approach involving a previously jailbroken Claude Opus 4.8, plus techniques like query decomposition, long-context framing, fiction, and academic taxonomies. Before launch, Anthropic said over 1,000 hours of internal and external red-teaming, including bug bounties, found no universal jailbreaks. The company acknowledges that preventing all sophisticated multi-turn or agentic attacks is likely not possible. Anthropic continues to refine its classifiers, but the cat-and-mouse game shows no sign of slowing.


Source: Anthropic's Claude Fable 5 plays it too safe on safety, developers say
Domain: fastcompany.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.