If you’ve ever put an AI agent into production, you’ve probably seen this pattern: the agent works, the flow is stable, CI is green… and then the bill shows up and it doesn’t match your intuition. Often, it’s not because the model is “thinking too long.” It’s because you’re paying repeatedly for the same prompt.
The root cause is simple and very familiar to anyone running APIs at scale: most LLM APIs are stateless. They don’t remember prior turns. So your client (or harness) keeps resending the same static payload every request — system prompt, tool definitions, guardrails, instruction blocks — even when none of it changes.
In short conversations, it’s noise. In agentic workflows with 30–100 turns, tools, retries, and validations, it becomes a quiet budget killer.
Anthropic’s answer is prompt caching, and the big shift is that it can now be used in a largely automatic way: cache the static prefix once, reuse it cheaply across turns, and reduce both cost and latency — without rewriting your agent architecture from scratch.
Why agents bleed money: stateless APIs + long system prompts
Agent frameworks tend to accumulate prompt weight:
- a large system prompt (“how to think,” policies, formatting, constraints)
- tool schemas (often thousands of tokens)
- reusable instructions (company rules, playbooks, template outputs)
If your agent runs 50 turns and your static preamble is 10,000 tokens, you effectively paid for 500,000 input tokens of the same content. You’re not paying for new reasoning — you’re paying for re-reading.
This is the “hidden tax” many teams only discover after they scale.
What prompt caching is (and what it isn’t)
Prompt caching is not memory. It doesn’t “remember” the conversation in a human sense. It’s more like an optimization layer:
- Anthropic caches an identical prefix of your prompt (tools + system + early messages).
- If that prefix matches in a later request, Claude reuses the cached computation instead of processing it again.
- You still send the same prompt content, but the platform can charge it differently and compute it faster.
The outcome is practical: lower cost for repeated prompt parts and improved time-to-first-token.
The big operational win: automatic caching as your context grows
Anthropic recommends an “automatic” approach where caching applies to the latest cacheable prefix and moves forward with your conversation. Conceptually:
- First request: cache the static prefix (tools/system/instructions).
- Next request: reuse cached prefix; only compute the new user/tool/result chunks.
- As the agent adds more stable content, the caching breakpoint shifts forward.
This matters for agents because it aligns with how agents behave: stable instructions up front, dynamic events at the end.
Pricing mechanics and TTL: what actually changes on the bill
Anthropic separates prompt cost into three classes: writing cache, reading cache, and normal input.
Key multipliers (relative to base input pricing):
- Cache write (5 minutes): 1.25×
- Cache write (1 hour): 2×
- Cache read (hits/refreshes): 0.1×
Default cache lifetime is 5 minutes, and it refreshes when used. If your agent has long pauses (human approvals, long-running jobs, async workflows), you can opt into a 1-hour TTL at higher cache-write cost.
From a platform perspective: you pay a bit more once to cache, then pay dramatically less to reuse.
Why sysadmins should care: cost + rate-limit behavior + observability
For platform and SRE teams, prompt caching isn’t just about saving money. It also reduces pressure on your system:
- Less compute per request → improved latency in high-volume flows
- More predictable spend → fewer “surprise invoices”
- Better control over prompt drift → if your cache hits drop, something changed in your harness
Anthropic also exposes usage counters that let you build dashboards and alerts, typically including fields like:
cache_read_input_tokens(how much you saved)cache_creation_input_tokens(what you wrote to cache)input_tokensandoutput_tokens(total payload behavior)
A very common operational alert: cache_read_input_tokens suddenly goes to 0. That usually means your prefix is no longer identical — tools changed, system prompt mutated, or you accidentally injected dynamic data “above” the cached region.
How to design prompts so caching actually works
Caching only helps if the prefix stays stable. Three rules cover most real-world cases:
- Put static content first
System instructions, tool definitions, stable examples, policies, formatting rules. - Push dynamic content to the end
User input, tool outputs, timestamps, IDs, request-specific context, logs. - Don’t churn tool schemas
If yourtoolslist changes every turn, the prefix changes every turn.
A pragmatic approach is to version your static blocks:
tools_v12,policy_v3,format_v7
When you update them, you expect cache invalidation. That’s fine — just don’t do it accidentally every request.
Quick example: enabling caching (conceptual)
Anthropic’s docs describe enabling caching with a cache_control configuration. A simplified example:
{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"cache_control": { "type": "ephemeral" },
"system": "Stable policies, formats, tool usage rules...",
"messages": [
{ "role": "user", "content": "Dynamic input here..." }
]
}
For longer TTL, you’d use the 1-hour option (at the higher cache-write multiplier).
This isn’t just Anthropic: caching is becoming table-stakes
Prompt caching is quickly becoming standard infrastructure for agentic systems. OpenAI documents prompt caching on repeated prefixes, and AWS Bedrock offers caching options for compatible models. The industry trend is clear: as agents become normal, re-reading the same context becomes the new enemy, and caching becomes the baseline mitigation.
FAQs
Does prompt caching give Claude “memory”?
No. It doesn’t store conversational memory. It reuses computation for identical prompt prefixes to reduce cost and latency.
What breaks cache hits most often?
Changing tool definitions, reordering the prompt, or injecting dynamic data into the early (cached) portion of the prompt.
When should I use a 1-hour cache TTL instead of 5 minutes?
When your agent has long pauses between turns (human-in-the-loop approvals, background jobs, slow tool runs) and you still want reuse.
What metric should I monitor to confirm caching works?
Track cached-read tokens (e.g., cache_read_input_tokens). If it drops to zero unexpectedly, your stable prefix likely changed.
