Prompt Engineering

Claude Fable 5 Is So Capable You're Paying 2x to Run Scaffolding It Doesn't Need. Cut These 5 (Keep 1). (June 2026)

Published on June 10, 2026·10 min read

On June 9, 2026, Anthropic released Claude Fable 5, the most capable model it has ever made generally available [1]. It is a Mythos-class model with safety classifiers layered on top, so a request that touches cybersecurity, biology, chemistry or distillation quietly gets answered by Claude Opus 4.8 instead, in fewer than 5% of sessions [1]. It hits 80.3% on SWE-bench Pro where GPT-5.5 sits at 58.6%, rebuilds a web app's source code from screenshots alone, extracts precise values from scientific figures, and finished Pokemon FireRed from raw pixels with no helper tools [1]. Anthropic's own line is the one to remember: "the longer and more complex the task, the larger Fable 5's lead over our other models" [1]. It is on the API now and free on Pro, Max and Team plans through June 22, so you can test this claim yourself this week.

It is also expensive. $10 per million input tokens,$ 50 per million output, roughly double Opus 4.8 and the priciest major model on the market [1][2]. Simon Willison spent $110.42 in a single day of testing and called it "a beast" that is slow and costly [2].

Here is the counter-intuitive part, and the reason I am writing this instead of another benchmark recap: a meaningful chunk of your prompts is scaffolding you added to compensate for weaker models, and on Fable 5 that scaffolding is now paying premium prices to do nothing. The move is not "delete your scaffolding." It is to separate the scaffolding that compensated for a weakness from the scaffolding that enforces a contract you still need, cut the first, and keep the second.

Two reasons scaffolding exists, only one is now obsolete

Most of the structure we bolt onto a prompt comes from one of two motives, and we rarely label which.

Compensation scaffolding exists because the model could not do something on its own. You broke a task into ten small prompts because the model lost the thread on the whole thing. You ran an OCR pass because it could not read the figure. You added a "now double-check your work" loop because it was unreliable. This scaffolding is a workaround for a capability gap, and when the gap closes the scaffolding becomes pure overhead.

Contract scaffolding exists because you need a guarantee independent of how smart the model is. A JSON schema the downstream code depends on. A null slot the model must return when there is no answer. A policy guard-rail. File-based memory that persists state across a long task. A stronger model does not make these unnecessary, because they were never about capability. They were about a contract.

The capability jump in Fable 5 collapses a lot of compensation scaffolding and zero contract scaffolding. The whole job is telling them apart. Below are five compensation patterns worth cutting, anchored to a specific Fable 5 capability, and one contract pattern people will cut by mistake.

Compensation versus contract: on the left, five scaffolds Fable 5 makes obsolete (manual task decomposition, OCR and pre-parsing, defensive self-verify loops, redundant few-shot examples, inflated expert role-priming) marked CUT; on the right, the scaffolds that enforce a contract independent of model strength (output schemas, file-based memory, policy guard-rails, null slots) marked KEEP

Your prompts can improve. Promptimizer rewrites and auto-tests them for you.

Try it free

Five scaffolds to cut

1. Manual decomposition into a chain of micro-prompts

The classic workaround for a model that lost the thread on a long task was to break it into ten small prompts and glue the outputs together yourself. Fable 5 "can work autonomously for longer than any previous Claude models" across millions of tokens of context [1], and Anthropic's own framing is that its lead grows precisely as the task gets longer and more complex [1]. The orchestration you were doing by hand was compensating for a short horizon that is no longer short. Hand it the whole task and let it carry the state. This is not the same as the "think step by step" phrasing I covered for GPT-5.5; that was about a phrase. This is about the manual sub-task plumbing wrapped around the model.

2. OCR and pre-parsing passes on documents and images

If your pipeline runs an OCR step, a table extractor, or a "describe this chart" preprocessing call before the real prompt, that was compensation for weak vision. Fable 5 is state of the art on vision: it extracts precise numbers straight from scientific figures and reconstructs a web app's source from screenshots alone [1]. The pre-extraction layer is now a lossy middleman sitting between the model and the pixels it reads better than your extractor does. Send the image.

3. Defensive self-verify loops

"Now re-read your answer and check it for errors, then verify again" was a reliability patch. It costs you a full extra generation pass every time, and on a $50-per-million-output model that pass is expensive. Fable 5's accuracy makes the blanket re-verify a poor trade: you are paying premium output tokens to re-derive an answer that was already right most of the time. Note this is different from the honesty upgrade I wrote about for Opus 4.8, where the model flags its own uncertainty unprompted; here the point is to drop the external re-verify scaffold, not to trust blindly. Keep verification where being wrong is genuinely expensive, and there gate it on the task, not on every call.

4. Redundant few-shot examples

Stacking five or six examples to teach a pattern was compensation for a model that would not infer the shape from one. A more capable model infers it from one clean example, and the extra four become input tokens you pay for on every call plus a subtle bias that narrows the output toward your samples. I argued the N=3-versus-N=1 version of this as a cost trick before; on Fable the case is stronger because the capability, not just the bill, says the extras are noise. Cut to one good example, or to a crisp description of the contract, and measure.

5. Inflated defensive role-priming

The techniques you're reading about work. Test your prompts now with Prompt Score and see your score in real time.

Test your prompts

"You are a meticulous senior engineer with twenty years of experience who never makes mistakes and triple-checks everything" is a confidence spell cast at a model that needed talking into competence. On a frontier model it is dead weight: it spends tokens, it does not change behavior much, and the part that does change is rarely the part you wanted. State the actual role and constraints that matter for the task and stop performing reassurance at the model.

The one to keep: state and contract scaffolding

Stat cards: Claude Fable 5 scores 80.3% on SWE-bench Pro versus 58.6% for GPT-5.5, costs roughly 2x Claude Opus 4.8 at 10 in and 50 out per million tokens, and reached the end of Pokemon FireRed 3x more often when given file-based memory

Here is the trap. The instinct after reading the five above is "great, a smarter model needs less of everything, strip it all." That is exactly wrong for one category, and Anthropic gave us the cleanest possible counter-example. When Fable 5 was given persistent file-based memory during the Pokemon run, it reached end-game states three times more often than without it [1]. The smartest publicly available model still tripled its results with state scaffolding.

That is because memory is contract scaffolding, not compensation. It is not patching a weakness the model would otherwise overcome with more intelligence; it is giving the task a place to keep state that no amount of raw capability replaces. The same holds for the output schema your downstream parser depends on, the null slot that prevents a confident hallucination in an empty case, and the policy guard-rails that encode rules rather than ability. Cut those and you do not save money, you remove a guarantee.

So the keep list is short and principled: anything that enforces a contract or persists state survives the model upgrade untouched. Everything that was a workaround for a gap goes.

Why the cut matters more at $50 a million

On a cheap model, leftover compensation scaffolding is sloppy but mostly harmless: a few wasted input tokens at a low rate. On Fable 5 the math changes. Every redundant few-shot example, every defensive re-verify pass, every over-decomposed sub-prompt is now billed at premium rates, output tokens especially at $50 per million [1]. The scaffolding that was free to leave in place is now the line item you can actually see on the invoice.

This is the bookend to the routing argument I made when xAI shipped a $1 coding model: there the play was to default down to the cheap fast tier for volume work and escalate the hard 10% up. Fable 5 is the top of that escalation, the model you reach for on the genuinely hard task. And precisely because it is the expensive end, the prompt you send it should do one thing only: ask for what the model cannot do on its own. Anything else is premium spend on a workaround you no longer need.

The rule is not "stronger model, longer prompt" and it is not "stronger model, shorter prompt." It is "stronger model, a prompt that does only what the model cannot do for itself."

What I cut, and the one I had to put back

I ran my own library through this lens the day Fable 5 landed. The easy wins were the OCR pre-pass on a document-analysis prompt and a four-example few-shot block on a classification prompt; both came out, the output held, the call got cheaper and shorter. The defensive re-verify loop on a summarization prompt went too, and that was fine.

The one I got wrong was a long multi-step research prompt where I had been threading state through a hand-built scaffold of intermediate files. My first instinct was to rip the whole thing out and let Fable carry the task end to end, since it is supposed to hold long horizons. It did hold the reasoning fine. What degraded was continuity across the run: without the explicit file-based state, it re-derived context it had already established and drifted on the details. I put the memory scaffold back and it was better than either version before. That is the compensation-versus-contract line drawn in production: the decomposition was compensation and deserved to die, the state was a contract and deserved to stay.

If you take one operational step from this, make it that audit. Keep one canonical version of each prompt, cut the scaffolding you suspect was compensation, and re-test against the same inputs before and after, so the decision is evidence and not vibes. A capability jump is exactly the moment your prompts accumulate dead weight, because the thing they were working around just stopped being a problem.

The signal

Fable 5 is a genuine step up, and the temptation with every step up is to celebrate the model and leave the prompts alone. The opposite is the move. A more capable model is a reason to audit your prompts down, not up, because the workarounds you wrote for last year's weaknesses are now overhead, and on a model this expensive they are overhead you pay for by the token. Cut the compensation, keep the contracts, and send the premium model only the work that is actually premium.

Keep My Prompts lets you keep one canonical version of each prompt, score it on six quality criteria, and compare the same prompt before and after you cut its scaffolding on your own inputs, so a model upgrade becomes a prompt audit instead of a guess. Free to start, no credit card required.

References

[1] Claude Fable 5 and Claude Mythos 5, Anthropic, June 9, 2026. https://www.anthropic.com/news/claude-fable-5-mythos-5

[2] Initial impressions of Claude Fable 5, Simon Willison, June 9, 2026. https://simonwillison.net/2026/Jun/9/claude-fable-5/

[3] Anthropic released Claude Fable 5, its most powerful model publicly, days after warning AI is getting too dangerous, TechCrunch, June 9, 2026. https://techcrunch.com/2026/06/09/anthropic-released-claude-fable-5-its-most-powerful-model-publicly-days-after-warning-ai-is-getting-too-dangerous/

#claude-fable-5#anthropic#prompt-scaffolding#prompt-engineering#model-upgrade#llm-cost#mythos#swe-bench#solo-dev#2026

Ready to organize your prompts?

Start free, no credit card required.

Start Free

No credit card required

Prompt Engineering

You Don't Need a Prompt Eval Harness Yet. Score First.

The advice says "set up evals." For most solo devs it is premature. Prompt QA has three layers, and you need the cheapest first. A three-question test for which one.

Read article →

Prompt Engineering

From 2 to 4: How We Fix a Low-Scoring Prompt One Criterion at a Time

When a prompt scores low, don't rewrite it. Fix the lowest load-bearing criterion, re-score, repeat. We take a real prompt from 2.1 to 4.0 in four passes.

Read article →

Prompt Engineering

Score Any Prompt in 3 Minutes: The 6-Criteria Rubric We Run Before Hitting Send

You judge prompts by their output. That's the bug. Here's the 3-minute, six-criteria rubric we run before hitting send, with a worked example and when to automate it.

Read article →

Two reasons scaffolding exists, only one is now obsolete

Five scaffolds to cut

1. Manual decomposition into a chain of micro-prompts

2. OCR and pre-parsing passes on documents and images

3. Defensive self-verify loops

4. Redundant few-shot examples

5. Inflated defensive role-priming

The one to keep: state and contract scaffolding

Why the cut matters more at $50 a million

What I cut, and the one I had to put back

The signal

References

Ready to organize your prompts?

Related articles

You Don't Need a Prompt Eval Harness Yet. Score First.

From 2 to 4: How We Fix a Low-Scoring Prompt One Criterion at a Time

Score Any Prompt in 3 Minutes: The 6-Criteria Rubric We Run Before Hitting Send