Prompt Caching: The 30% You're Leaving on the Table

Most AI developers are leaving 20–35% of their API bill on the table. Not because of bad prompts or wrong model choices. Because they’ve never heard of prompt caching, or assume it’s handled automatically.

It isn’t. At least not everywhere.

What Prompt Caching Actually Is

When you make repeated API calls with a shared prefix (a long system prompt, a set of tools, injected documents), the provider hashes that prefix, stores the KV cache on GPU, and skips recomputing it on subsequent calls. Those tokens become “cached input tokens” instead of regular input tokens. They’re cheaper.

The savings apply only to input tokens. Output tokens are always computed fresh. So the real bill reduction is 20–35%, not 90%. The 90% figure shows up in benchmarks that measure cached input tokens only, which are a fraction of total cost.

Whether caching fires depends entirely on who implements it and how.

Five Providers, Three Different Models

Anthropic

Anthropic gives you two options. The newer one: add a single top-level cache_control field and let the system manage breakpoints automatically. The older one: place cache_control on individual content blocks for fine-grained control. The automatic approach works for most use cases.

Nothing is cached if you don’t opt in.

Pricing:

OperationMultiplier
5-min cache write1.25x base input
1-hour cache write2x base input
Cache read0.1x base input (90% off)

Concrete numbers for Claude Sonnet: $3.75/MTok to write a 5-min cache, $0.30/MTok to read from it. Break-even is 1.4 reads per write.

TTL is 5 minutes or 1 hour. The 1-hour option costs more to write (2x instead of 1.25x) but makes sense for longer-running sessions.

OpenAI

No setup required. OpenAI caches automatically on all eligible requests. Cache writes are free. Reads are discounted based on the model:

Model familyCache discount
GPT-590% off
GPT-4.175% off
GPT-4o / o-series50% off

Minimum 1,024 tokens, with cache hits in 128-token increments. TTL is 5–10 minutes of inactivity, up to an hour during off-peak periods.

The only thing you give up: visibility. You can’t control what gets cached or reliably inspect hit rates.

Google Gemini

Google ships two separate systems and they work differently.

Implicit caching is automatic on Gemini 2.5+ models. No code changes. Hits get a 75% discount via the Gemini API, 90% on Vertex AI. No storage costs. No guarantee of a hit.

Explicit context caching requires creating a CachedContent object, getting a cache ID back, and passing it in subsequent requests. You pay $1.00/MTok/hour for storage on Gemini 2.5 models (more for larger models), but you’re guaranteed savings when the cache is hit.

The practical choice: implicit for general use, explicit when you have a large, stable prefix (multi-thousand-token system prompts, large document sets) and enough request volume to justify the storage cost.

Minimum tokens range from 1,024 (Gemini 2.5 Flash) to 4,096 (Gemini 2.5 Pro).

AWS Bedrock

Bedrock supports prompt caching for Claude and Amazon Nova models, using the same pricing structure as Anthropic direct. Cache reads are 90% off. Writes are 1.25x (5-min) or 2x (1-hour).

Nova models get automatic caching with no manual config required. Claude on Bedrock uses the same cache_control approach as the direct API.

One caveat: Anthropic’s automatic caching only routes through the direct Anthropic provider on Bedrock. If you’re using Bedrock or Vertex AI endpoints, you need explicit breakpoints.

OpenRouter

OpenRouter implements provider sticky routing. After a request that uses prompt caching, it routes subsequent requests for the same model to the same provider (tracked by hashing opening messages per account per model), keeping caches warm across requests.

Caching works through OpenRouter for OpenAI, Anthropic, Gemini 2.5, DeepSeek, Grok, and a few others. Pricing passes through from the underlying provider.

The limitation to know: Anthropic automatic caching only works when OpenRouter routes to the direct Anthropic provider. If it routes through Bedrock or Vertex AI, automatic caching won’t fire. For Anthropic caching through OpenRouter to work reliably, use explicit cache_control at the top level.

Tools vs. Building on the API

There’s an important distinction here. Claude Code (the coding assistant) is not the same as the Anthropic API. Claude Code handles caching automatically — it caches your system prompt, tools, and CLAUDE.md, and manages breakpoints for you. That’s the tool doing the work.

If you’re building on the Anthropic API directly — writing an agent, a custom assistant, a product like OpenClaw — you’re responsible for implementing cache_control yourself. The API doesn’t add it for you. Most teams don’t, which is why most custom Anthropic-based agents are overpaying.

Among the coding tools specifically, only four get Anthropic caching right:

  1. Claude Code: Automatic. Caches system prompt + tools + CLAUDE.md. Context compaction at ~65% invalidates the message cache.
  2. aider: Opt-in via --cache-prompts flag
  3. Cline: Implemented
  4. Continue.dev: Implemented

Every other tool either relies on OpenAI’s automatic caching or silently leaves Anthropic cache tokens unclaimed.

Provider Comparison

ProviderSetupWrite costRead discountMin tokensTTL
Anthropiccache_control or automatic+25% (5-min), +100% (1-hr)90% off1,0245 min / 1 hr
OpenAINoneFree50–90% (model-dependent)1,0245–10 min
Gemini (implicit)NoneNone75% off (API), 90% (Vertex)1,024–4,096N/A (best-effort)
Gemini (explicit)ManualStorage fee ($1/MTok/hr)75–90% off1,024–4,0961 hr default
AWS Bedrock (Claude)cache_control+25% / +100%90% off1,024–4,0965 min / 1 hr
OpenRouterProvider-dependentProvider pass-throughProvider pass-through1,024+Provider-dependent

The Quick Math

System prompt at 10K tokens, called 100 times/day on Claude Sonnet:

  • Without caching: 100 × 10K × $3/M = $3.00/day
  • With caching (after first call): 1 write + 99 reads = (10K × $3.75/M) + (99 × 10K × $0.30/M) = $0.34/day

That’s an 89% reduction on the cached portion. If system prompt tokens are 30% of your total bill, the real-world impact is around 25% total cost reduction.

One config change. One time.


See also: Why Bigger Context Windows Don’t Make Better AI, the related problem of what to put in that system prompt in the first place.

← All posts