Discussion about this post

User's avatar
Keeping TABs on Your AI Agents's avatar

On Thursday and Friday I ran Fable 5 through prompt injection, sycophancy, memory hallucination, and error recovery benchmarks. Hours later, the Commerce Department issued an export control directive and Anthropic disabled the model worldwide.

Here's what I measured:

Prompt injection resistance: 87.5% (median across 3 runs, 27 test cases)

The content filter blocked 70% of injection tests at the infrastructure layer before the model saw them. Of the 30% that reached the model, it resisted most but was consistently bypassed by delimiter/tag-closing attacks. Fiction framing and vendor impersonation bypassed it on some runs.

Sycophancy resistance: 64.3% (95 tests)

Four consecutive Anthropic releases show declining sycophancy resistance: Opus 4.6 (68%) → 4.7 (67.7%) → 4.8 (64.5%) → Fable 5 (64.3%). The model increasingly agrees with wrong humans rather than maintaining its position.

Memory hallucination: 63% composite (80 tests, 3 runs)

70% QA accuracy, 30% hallucination rate. The same 12 questions failed every run. All errors at the Generation stage: the model had the right information stored but generated wrong answers from it. Zero content-filtered.

Error recovery: 75.3 (40 tests)

Only scored on 65% of the suite. The other 35% was content-filtered.

Filter variance. The same content filter that blocks 70% of injection tests blocks 0% of memory tests and 3% of conversational tests. The filter does most of the security work. When it doesn't fire, the model's own judgment has measurable holes.

I'm not affiliated with Anthropic or any model provider. I test every model the same way and report what happens.

3 more comments...

No posts

Ready for more?