Trends

Anthropic Quietly Cut Claude's Cache TTL: The 2026 Fix to Reclaim 30% of Your API Bill

Published on May 7, 2026·11 min read

A small team running a Claude-powered nightly summarizer noticed something strange in late March 2026. Same cron job, same prompts, same input volume. Monthly bill went from $180 to$ 290. No code changes. No traffic spike. Just a quiet 60% increase that did not show up in any provider changelog.

They are not alone. Between March and April 2026, dozens of developers on GitHub, Reddit, and Hacker News reported the same pattern: Claude API costs creeping up, Pro subscribers hitting Claude Code's 5-hour quota for the first time ever, and no clear explanation from Anthropic.

The cause turned out to be a single silent infrastructure change. The default TTL on Anthropic's prompt cache was reduced from 1 hour to 5 minutes around March 6-7, 2026. There was no blog post, no deprecation notice, no API version bump [1]. For workloads that relied on long-lived caches, the economics broke overnight.

This guide walks through what actually happened, why small teams felt it the hardest, and five concrete fixes you can ship Monday morning to reclaim the budget you lost.

What Actually Changed

The official Anthropic prompt caching docs describe two cache durations: 5 minutes (1.25× the base input token price for writes) and 1 hour (2× for writes). Cache reads cost 0.1× regardless of TTL [3]. Until early March 2026, the platform-wide default was 1 hour. After March 6-7, the default reverted to 5 minutes [1].

A developer named Sean Swanson published the most rigorous reconstruction of the timeline by parsing 119,866 Claude Code API calls across two machines from January 11 through April 11, 2026 [1]. The pattern is unambiguous:

January 11 to January 31: 5-minute TTL (predates 1-hour tier general availability)
February 1 to March 5: 1-hour TTL on every call, 33 consecutive days, both machines
March 6 to March 7: transition window, 5-minute tokens reappear
March 8 onward: 5-minute TTL dominant, 1-hour usage drops to a minority

The cost impact in that dataset, comparing actual spend to a counterfactual where 1-hour TTL stayed default:

Month	% waste	Sonnet-4-6 overpaid	Opus-4-6 overpaid
Jan 2026	52.5%	$41.45	$69.08
Feb 2026	1.1%	$12.32	$20.53
Mar 2026	25.9%	$719.09	$1,198.49
Apr 2026	14.8%	$176.23	$293.71
Total	17.1%	$949.08	$1,581.80

The percentage waste is identical across models because the 5m vs 1h cache surcharge is a flat multiplier on token volume, not on token rate. The dollar columns differ because Opus is more expensive per token.

The "1.1%" in February is the tell: when 1-hour TTL was active, cache read economics worked as designed. When it flipped to 5 minutes, the same workload paid 25% more for nothing.

Anthropic's only public response came through Jarred Sumner, an engineer at the company, in coverage by The Register [2]. Sumner reframed the change as beneficial for "one-shot calls where the cached context is used once and not revisited." The statement did not address the documented quota exhaustion or cost inflation. The GitHub issue was closed as "not planned" [1].

Why Small Teams Hit This Hardest

To see why a 5-minute TTL hurts more than it sounds like it should, you have to understand prompt cache economics.

Want to know how effective your prompts are? Prompt Score analyzes them on 6 criteria.

Try it free

A cache write costs 1.25× the base input token rate at 5-minute TTL, or 2× at 1-hour TTL. A cache read costs 0.1× regardless of TTL. The arbitrage is the 12.5× delta between writes and reads. Caching is profitable when a cached prefix is read many times before the TTL expires. A long system prompt that is read 50 times per hour pays for itself instantly. Same prompt read once every 30 minutes? Useless.

Small-team workloads tend to look exactly like the second case:

A nightly cron that summarizes the day's tickets: one call per night, 24-hour gap between cache write and the next time the same context shows up.
A side project that picks up traffic in 10-minute bursts when its creator is debugging: 5-10 reads inside the burst, then nothing for hours.
A Slack bot that handles a few messages per day: each call is effectively cold.
An agent loop that takes 8-15 minutes between human inputs: writes the cache, walks past the 5-minute TTL, writes it again on the next turn.

Anthropic's reframing assumes a workload where one-shot calls dominate and the 1.25× write surcharge is wasted. That is true for some teams. For small teams running long-context Claude workflows on irregular cadences, the new default forces them to pay the write surcharge on every call, with no read amortization to offset it.

The Pro subscription quota hit added insult. Claude Code Pro plans count cache creation tokens against the 5-hour limit at full rate; cache reads barely register. With 1-hour TTL, a developer's first prompt of the morning paid the write cost once and rode reads for the rest of the session. With 5-minute TTL, every coffee break invalidated the cache and the next message paid the write cost again. Anecdotal reports of "I hit my Pro quota in five hours and I never used to" trace back to exactly this dynamic [2].

Five Fixes You Can Ship Monday

These are ordered by effort. The first three are mechanical. The last two require restructuring how your code talks to the Claude API.

Fix 1: Set the 1-Hour TTL Explicitly

The 1-hour cache tier is still available. It is just no longer the default. If your workload has any kind of session shape longer than 5 minutes, override the default in every API call:

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        }
    ],
    messages=[...]
)

The 1-hour write surcharge (2× vs 1.25×) is more expensive per write, but you only pay it once per hour instead of once per 5 minutes. For any workload that rereads a cached prefix more than 4-5 times per hour, the 1-hour tier is mathematically cheaper. We will work the numbers in the decision matrix below.

Fix 2: Restructure Prompts to Fit the 5-Minute Window

If your traffic is genuinely bursty and you do not want to pay the 2× write surcharge, you can restructure the work to fit inside 5 minutes. Two patterns work well:

Burst-batch. Instead of firing one Claude call when each user message arrives, queue messages for up to 60 seconds and send them as a single batch. Five sequential messages amortize the cache write across five reads instead of five rewrites.

Pipeline collapsing. If your agent loop has a "plan, then refine, then execute" three-step pipeline with human approval between steps, fire all three Claude calls before the human review. Cache stays warm. Re-prompt only on rejection.

The techniques you're reading about work. Test your prompts now with Prompt Score and see your score in real time.

Test your prompts

Fix 3: Cache-Warming Heartbeat (Only Sometimes)

If a single cached prefix serves multiple users, or a long sparse agent loop pays a lot in fresh writes, you can keep the cache warm with a minimal keepalive every 4 minutes:

async def keep_cache_warm():
    await client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1,
        system=[{"type": "text", "text": SHARED_PREFIX,
                 "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": "ping"}]
    )

Each ping costs the price of one cache read (0.1× input rate) plus 1 output token. The trade-off is against the cost of letting the cache expire and writing it from scratch. As a rough rule, a heartbeat pays off when:

The prompt is large (12k+ tokens) so each fresh write is expensive
Multiple sessions or users share the same cached prefix
You would otherwise pay more in fresh writes than the heartbeat costs in reads

For a single user with sparse traffic, the cache is usually cheaper to let expire. Run the math against your real session pattern before shipping the heartbeat.

Fix 4: Audit Your Cache Hit Ratio

You cannot fix what you do not measure. Every Claude API response includes usage.cache_read_input_tokens and usage.cache_creation_input_tokens. The ratio of read to creation, weighted by token count, is your cache hit ratio. A healthy long-context workload runs at 0.85+. Below 0.5 means you are paying the write surcharge on most calls.

Log both values per request and aggregate weekly. If your hit ratio dropped between February and March 2026, you are looking at the regression in your own data.

Fix 5: Multi-Model Fallback for Volume

For high-volume non-frontier work, the cleanest hedge is to route by task instead of relying on cache economics. DeepSeek V4-Pro, released April 24, 2026, scores 80.6 on SWE-bench (statistically tied with Opus 4.7 at 80.8) at roughly 10× lower per-token cost and offers a 1M context window [4]. For classification, extraction, and templating, pushing 80% of calls to a cheaper model with looser caching requirements removes the entire problem.

This is not a Claude replacement story. Frontier reasoning still belongs on Opus 4.7. But every call that does not need frontier reasoning and runs frequently is a candidate for offload.

Decision Matrix: Pay the 1-Hour Premium or Restructure?

The break-even between 5-minute and 1-hour TTL depends on three variables: prompt size, reuse frequency, and burst shape. Here is the worked math for a typical long-context coding agent.

Setup. A 12,000-token system prompt. Opus 4.7 at $5/MTok input, so base write at 5m TTL is$ 5 × 1.25 = $6.25/MTok, write at 1h TTL is$ 10/MTok, read is $0.50/MTok at either tier.

Scenario A: 1 cache write + 8 reads inside 5 minutes.

5m TTL: 12k × $6.25 + 8 × 12k ×$ 0.50 = $0.075 +$ 0.048 = $0.123
1h TTL: 12k × $10 + 8 × 12k ×$ 0.50 = $0.12 +$ 0.048 = $0.168
Verdict: 5-minute TTL wins. The work fits inside one window.

Scenario B: 8 reads spread evenly over 30 minutes (6 cache windows on 5m TTL, 1 on 1h TTL).

5m TTL: 6 writes × $0.075 + 8 reads ×$ 0.006 = $0.45 +$ 0.048 = $0.498
1h TTL: 1 write × $0.12 + 8 reads ×$ 0.006 = $0.12 +$ 0.048 = $0.168
Verdict: 1-hour TTL wins by 3×.

Scenario C: 1 read every 30 minutes for 4 hours (8 reads, 4 hour-aligned writes on 1h TTL).

5m TTL: 8 writes × $0.075 + 8 reads ×$ 0.006 = $0.60 +$ 0.048 = $0.648
1h TTL: 4 writes × $0.12 + 8 reads ×$ 0.006 = $0.48 +$ 0.048 = $0.528
Verdict: 1-hour TTL wins by 1.2×.

The shortcut: if your prompt is reused fewer than 5 times in 5 minutes but more than once per hour, switch to the 1-hour TTL. If reuse is denser than that, stay on 5 minutes. If reuse is sparser than once per hour, caching is not your problem and you should be looking at multi-model offload.

Calling Anthropic Out (Without Becoming a Flame Piece)

The technical fixes above will get you back to where you were in February. They do not address the underlying issue: silent infrastructure changes are dangerous for any team that does not run constant cost telemetry, and that is most small teams.

Three things Anthropic should ship to make this less likely to repeat:

A public changelog for default behaviors. The 1-hour TTL becoming the default in February 2026 was never announced; the reversion in March was never announced. Both should have been changelog entries with effective dates. Other infrastructure changes (rate limit shifts, model routing tweaks, cache eviction policy) likely happen with similar opacity.

Default to last-known-good. When platform defaults change, the operationally safe pattern is to grandfather existing API keys to the prior default for a deprecation window. AWS, Stripe, and most mature platforms do this. Anthropic does not, at least not yet.

Transparent quota counting. Pro subscribers seeing "you hit your 5-hour limit" with no breakdown of cache writes, reads, and uncached tokens have no way to debug. A usage panel showing where the quota actually goes would let teams adjust before they get cut off mid-session.

None of these are exotic asks. They are table stakes for an API platform that small teams build production workflows on.

What This Means If You Use Keep My Prompts

Whatever Anthropic ships next, the prompts that survive infrastructure changes best are the ones with clean structure. A long, well-organized system prompt with clear sections caches as a single block. A chaotic prompt that mixes system instructions, examples, and user-specific context all in one wall of text fragments the cache and pays the write surcharge on every call.

Score your prompts on structural quality before you ship. If your Score is above 4, your prompt is cache-friendly by accident. If it is below 3, you are leaving cache money on the table whether or not Anthropic changes the TTL again next month.

Run any prompt through Score at keepmyprompts.com to see where it stands. Free plan, 20 prompts, takes about 10 seconds per scoring run.

Sources

[1] Sean Swanson, "Cache TTL silently regressed from 1h to 5m around early March 2026, causing quota and cost inflation," GitHub issue #46829 on anthropics/claude-code, April 2026. https://github.com/anthropics/claude-code/issues/46829

[2] The Register, "Anthropic: Claude quota drain not caused by cache tweaks," April 13, 2026. https://www.theregister.com/2026/04/13/claude_code_cache_confusion/

[3] Anthropic, "Prompt caching" documentation. https://platform.claude.com/docs/en/build-with-claude/prompt-caching

[4] Simon Willison, "DeepSeek V4," April 24, 2026. https://simonwillison.net/2026/Apr/24/deepseek-v4/

#anthropic#claude#prompt-caching#ttl#api-costs#claude-code#prompt-engineering#cache-economics#2026

Ready to organize your prompts?

Start free, no credit card required.

Start Free

No credit card required

Trends

GLM-5.2 Beats GPT-5.5 on Coding at 1/6 the Cost: The 80/20 Migration (and the One Catch) (June 2026)

An open model that beats GPT-5.5 on code at 1/6 the cost. The 80/20 to move to GLM-5.2, what to keep on the frontier, and the one catch: where your code goes.

Read article →

Trends

xAI Just Put a $1 Agentic Coding Model on Its API. Stop Defaulting to the Expensive One. (June 2026)

xAI opened Grok Build 0.1 on its API: a $1, fast, MCP-native agentic coding model. The cheap tier is good enough now, so default to it for the volume work and escalate only the hard 10%.

Read article →

Trends

Apple Just Made the AI Model a Setting Your Users Flip. Your Prompts Have to Survive That. (June 2026)

At WWDC, Apple turned the AI model into a user setting: iOS 27 lets people pick ChatGPT, Gemini or Claude. If you ship prompts in an app, you no longer choose the model that runs them.

Read article →

What Actually Changed

Why Small Teams Hit This Hardest

Five Fixes You Can Ship Monday

Fix 1: Set the 1-Hour TTL Explicitly

Fix 2: Restructure Prompts to Fit the 5-Minute Window

Fix 3: Cache-Warming Heartbeat (Only Sometimes)

Fix 4: Audit Your Cache Hit Ratio

Fix 5: Multi-Model Fallback for Volume

Decision Matrix: Pay the 1-Hour Premium or Restructure?

Calling Anthropic Out (Without Becoming a Flame Piece)

What This Means If You Use Keep My Prompts

Sources

Ready to organize your prompts?

Related articles

GLM-5.2 Beats GPT-5.5 on Coding at 1/6 the Cost: The 80/20 Migration (and the One Catch) (June 2026)

xAI Just Put a $1 Agentic Coding Model on Its API. Stop Defaulting to the Expensive One. (June 2026)

Apple Just Made the AI Model a Setting Your Users Flip. Your Prompts Have to Survive That. (June 2026)