A small team running a Claude-powered nightly summarizer noticed something strange in late March 2026. Same cron job, same prompts, same input volume. Monthly bill went from 180to290. No code changes. No traffic spike. Just a quiet 60% increase that did not show up in any provider changelog.
They are not alone. Between March and April 2026, dozens of developers on GitHub, Reddit, and Hacker News reported the same pattern: Claude API costs creeping up, Pro subscribers hitting Claude Code's 5-hour quota for the first time ever, and no clear explanation from Anthropic.
The cause turned out to be a single silent infrastructure change. The default TTL on Anthropic's prompt cache was reduced from 1 hour to 5 minutes around March 6-7, 2026. There was no blog post, no deprecation notice, no API version bump [1]. For workloads that relied on long-lived caches, the economics broke overnight.
This guide walks through what actually happened, why small teams felt it the hardest, and five concrete fixes you can ship Monday morning to reclaim the budget you lost.
Anthropic cache TTL cut featured
What Actually Changed
The official Anthropic prompt caching docs describe two cache durations: 5 minutes (1.25ร the base input token price for writes) and 1 hour (2ร for writes). Cache reads cost 0.1ร regardless of TTL [3]. Until early March 2026, the platform-wide default was 1 hour. After March 6-7, the default reverted to 5 minutes [1].
A developer named Sean Swanson published the most rigorous reconstruction of the timeline by parsing 119,866 Claude Code API calls across two machines from January 11 through April 11, 2026 [1]. The pattern is unambiguous:
January 11 to January 31: 5-minute TTL (predates 1-hour tier general availability)
February 1 to March 5: 1-hour TTL on every call, 33 consecutive days, both machines
March 6 to March 7: transition window, 5-minute tokens reappear
March 8 onward: 5-minute TTL dominant, 1-hour usage drops to a minority
The cost impact in that dataset, comparing actual spend to a counterfactual where 1-hour TTL stayed default:
Month
% waste
Sonnet-4-6 overpaid
Opus-4-6 overpaid
Jan 2026
52.5%
$41.45
$69.08
Feb 2026
1.1%
$12.32
$20.53
Mar 2026
25.9%
$719.09
$1,198.49
Apr 2026
14.8%
$176.23
$293.71
Total
17.1%
$949.08
$1,581.80
The percentage waste is identical across models because the 5m vs 1h cache surcharge is a flat multiplier on token volume, not on token rate. The dollar columns differ because Opus is more expensive per token.
The "1.1%" in February is the tell: when 1-hour TTL was active, cache read economics worked as designed. When it flipped to 5 minutes, the same workload paid 25% more for nothing.
Anthropic's only public response came through Jarred Sumner, an engineer at the company, in coverage by The Register [2]. Sumner reframed the change as beneficial for "one-shot calls where the cached context is used once and not revisited." The statement did not address the documented quota exhaustion or cost inflation. The GitHub issue was closed as "not planned" [1].
Cache TTL timeline and cost impact
Why Small Teams Hit This Hardest
To see why a 5-minute TTL hurts more than it sounds like it should, you have to understand prompt cache economics.
Want to know how effective your prompts are? Prompt Score analyzes them on 6 criteria.
A cache write costs 1.25ร the base input token rate at 5-minute TTL, or 2ร at 1-hour TTL. A cache read costs 0.1ร regardless of TTL. The arbitrage is the 12.5ร delta between writes and reads. Caching is profitable when a cached prefix is read many times before the TTL expires. A long system prompt that is read 50 times per hour pays for itself instantly. Same prompt read once every 30 minutes? Useless.
Small-team workloads tend to look exactly like the second case:
A nightly cron that summarizes the day's tickets: one call per night, 24-hour gap between cache write and the next time the same context shows up.
A side project that picks up traffic in 10-minute bursts when its creator is debugging: 5-10 reads inside the burst, then nothing for hours.
A Slack bot that handles a few messages per day: each call is effectively cold.
An agent loop that takes 8-15 minutes between human inputs: writes the cache, walks past the 5-minute TTL, writes it again on the next turn.
Anthropic's reframing assumes a workload where one-shot calls dominate and the 1.25ร write surcharge is wasted. That is true for some teams. For small teams running long-context Claude workflows on irregular cadences, the new default forces them to pay the write surcharge on every call, with no read amortization to offset it.
The Pro subscription quota hit added insult. Claude Code Pro plans count cache creation tokens against the 5-hour limit at full rate; cache reads barely register. With 1-hour TTL, a developer's first prompt of the morning paid the write cost once and rode reads for the rest of the session. With 5-minute TTL, every coffee break invalidated the cache and the next message paid the write cost again. Anecdotal reports of "I hit my Pro quota in five hours and I never used to" trace back to exactly this dynamic [2].
Five Fixes You Can Ship Monday
These are ordered by effort. The first three are mechanical. The last two require restructuring how your code talks to the Claude API.
Fix 1: Set the 1-Hour TTL Explicitly
The 1-hour cache tier is still available. It is just no longer the default. If your workload has any kind of session shape longer than 5 minutes, override the default in every API call:
The 1-hour write surcharge (2ร vs 1.25ร) is more expensive per write, but you only pay it once per hour instead of once per 5 minutes. For any workload that rereads a cached prefix more than 4-5 times per hour, the 1-hour tier is mathematically cheaper. We will work the numbers in the decision matrix below.
Fix 2: Restructure Prompts to Fit the 5-Minute Window
If your traffic is genuinely bursty and you do not want to pay the 2ร write surcharge, you can restructure the work to fit inside 5 minutes. Two patterns work well:
Burst-batch. Instead of firing one Claude call when each user message arrives, queue messages for up to 60 seconds and send them as a single batch. Five sequential messages amortize the cache write across five reads instead of five rewrites.
Pipeline collapsing. If your agent loop has a "plan, then refine, then execute" three-step pipeline with human approval between steps, fire all three Claude calls before the human review. Cache stays warm. Re-prompt only on rejection.
The techniques you're reading about work. Test your prompts now with Prompt Score and see your score in real time.
If a single cached prefix serves multiple users, or a long sparse agent loop pays a lot in fresh writes, you can keep the cache warm with a minimal keepalive every 4 minutes:
Each ping costs the price of one cache read (0.1ร input rate) plus 1 output token. The trade-off is against the cost of letting the cache expire and writing it from scratch. As a rough rule, a heartbeat pays off when:
The prompt is large (12k+ tokens) so each fresh write is expensive
Multiple sessions or users share the same cached prefix
You would otherwise pay more in fresh writes than the heartbeat costs in reads
For a single user with sparse traffic, the cache is usually cheaper to let expire. Run the math against your real session pattern before shipping the heartbeat.
Fix 4: Audit Your Cache Hit Ratio
You cannot fix what you do not measure. Every Claude API response includes usage.cache_read_input_tokens and usage.cache_creation_input_tokens. The ratio of read to creation, weighted by token count, is your cache hit ratio. A healthy long-context workload runs at 0.85+. Below 0.5 means you are paying the write surcharge on most calls.
Log both values per request and aggregate weekly. If your hit ratio dropped between February and March 2026, you are looking at the regression in your own data.
Fix 5: Multi-Model Fallback for Volume
For high-volume non-frontier work, the cleanest hedge is to route by task instead of relying on cache economics. DeepSeek V4-Pro, released April 24, 2026, scores 80.6 on SWE-bench (statistically tied with Opus 4.7 at 80.8) at roughly 10ร lower per-token cost and offers a 1M context window [4]. For classification, extraction, and templating, pushing 80% of calls to a cheaper model with looser caching requirements removes the entire problem.
This is not a Claude replacement story. Frontier reasoning still belongs on Opus 4.7. But every call that does not need frontier reasoning and runs frequently is a candidate for offload.
Decision Matrix: Pay the 1-Hour Premium or Restructure?
The break-even between 5-minute and 1-hour TTL depends on three variables: prompt size, reuse frequency, and burst shape. Here is the worked math for a typical long-context coding agent.
Setup. A 12,000-token system prompt. Opus 4.7 at 5/MTokinput,sobasewriteat5mTTLis5 ร 1.25 = 6.25/MTok,writeat1hTTLis10/MTok, read is $0.50/MTok at either tier.
The shortcut: if your prompt is reused fewer than 5 times in 5 minutes but more than once per hour, switch to the 1-hour TTL. If reuse is denser than that, stay on 5 minutes. If reuse is sparser than once per hour, caching is not your problem and you should be looking at multi-model offload.
Decision matrix 1h vs 5m TTL
Calling Anthropic Out (Without Becoming a Flame Piece)
The technical fixes above will get you back to where you were in February. They do not address the underlying issue: silent infrastructure changes are dangerous for any team that does not run constant cost telemetry, and that is most small teams.
Three things Anthropic should ship to make this less likely to repeat:
A public changelog for default behaviors. The 1-hour TTL becoming the default in February 2026 was never announced; the reversion in March was never announced. Both should have been changelog entries with effective dates. Other infrastructure changes (rate limit shifts, model routing tweaks, cache eviction policy) likely happen with similar opacity.
Default to last-known-good. When platform defaults change, the operationally safe pattern is to grandfather existing API keys to the prior default for a deprecation window. AWS, Stripe, and most mature platforms do this. Anthropic does not, at least not yet.
Transparent quota counting. Pro subscribers seeing "you hit your 5-hour limit" with no breakdown of cache writes, reads, and uncached tokens have no way to debug. A usage panel showing where the quota actually goes would let teams adjust before they get cut off mid-session.
None of these are exotic asks. They are table stakes for an API platform that small teams build production workflows on.
What This Means If You Use Keep My Prompts
Whatever Anthropic ships next, the prompts that survive infrastructure changes best are the ones with clean structure. A long, well-organized system prompt with clear sections caches as a single block. A chaotic prompt that mixes system instructions, examples, and user-specific context all in one wall of text fragments the cache and pays the write surcharge on every call.
Score your prompts on structural quality before you ship. If your Score is above 4, your prompt is cache-friendly by accident. If it is below 3, you are leaving cache money on the table whether or not Anthropic changes the TTL again next month.
Run any prompt through Score at keepmyprompts.com to see where it stands. Free plan, 20 prompts, takes about 10 seconds per scoring run.
Sources
[1] Sean Swanson, "Cache TTL silently regressed from 1h to 5m around early March 2026, causing quota and cost inflation," GitHub issue #46829 on anthropics/claude-code, April 2026. https://github.com/anthropics/claude-code/issues/46829