The difference between a $100/day agent and a $10/day agent is often just prompt caching.
- Market Wisdom
Glossary
Before we dive deep, let’s understand the basic terms:
Prefix
The prefix is the initial part of your prompt - everything that comes before the user’s latest message.
Think of it like an address: “USA, California, San Francisco” - each level is a prefix of what follows.
In a prompt, the prefix includes: system prompt, tool definitions, conversation history - everything that’s constant and doesn’t change between requests.
Hash
A hash is a mathematical function that takes text of any length and returns a short, unique “fingerprint”.
"Hello World" → hash → "abc123"
"Hello World!" → hash → "xyz789" // One exclamation mark changed everything!The hash serves as an identifier key: if two prompts produce the same hash, they’re identical.
Cache Write vs Cache Read
Cache Write happens the first time you send a particular prompt - costs 1.25x but it’s a one-time investment.
Cache Read happens when you send a prompt with an identical prefix - costs only 0.1x, a tenth of the price!
The Problem
Every API call to Claude does the same thing: sends tokens, waits for response. But what happens inside isn’t trivial.
User sends: "What's the weather in Tel Aviv?"
API receives:
┌─────────────────────────────────────────────┐
│ System: "You are a helpful assistant..." │ 2,000 tokens
│ Tools: [get_weather, search, ...] │ 5,000 tokens
│ History: [previous 10 turns] │ 8,000 tokens
│ User: "What's the weather in Tel Aviv?" │ 10 tokens
└─────────────────────────────────────────────┘
Total: 15,010 tokensThe model needs to process all 15,010 tokens before it can start generating a response.
Without caching vs With caching
Without: Process 15,000+ tokens again.
With: Process only the new tokens!
Just like you don’t need to explain who you are every time you call a good friend - your agent doesn’t need to “read” the system prompt and tools from scratch on every request.
What is KV Cache?
To understand prompt caching, you need to understand what happens inside a Transformer.
Prefill - The Expensive Part
When the prompt enters, each token passes through every layer of the model. At each layer, two vectors are computed: Key and Value.
Key = “What do I represent?”
Value = “What’s my content?”
This is the expensive part. For a 100K token prompt and a model with 80 layers, you compute 16 million vectors.
Generation - The Cheap Part
After prefill, the model generates one token at a time. Each new token requires only one KV computation (its own), then attention over all existing KVs.
The Caching Insight
The key insight: KV pairs for a constant prefix don’t change between calls.
Call 1:
[System + Tools] → compute KV pairs → STORE IN CACHE
[History + User] → compute KV pairs → generate response
Call 2:
[System + Tools] → LOAD FROM CACHE (skip computation!)
[New History + User] → compute KV pairs → generate responsePrompt caching = storing the KV cache of a constant prefix and reusing it.
How The Cache Works
Prefix Hashing
The cache works on prefix matching. The key is a hash of all tokens up to the cache point:
"You are helpful" + "Tools: ..." → hash: abc123
"You are helpful" + "Tools: ..." + " extra" → hash: def456 (different!)Any change to the prefix, even one character, creates a different hash = cache miss.
Hierarchy Order
Claude requires a fixed order: tools → system → messages
Any change at a higher level breaks the cache for everything below:
| Change | tools cache | system cache | messages cache |
|---|---|---|---|
| Add tool | broken | broken | broken |
| Change system | valid | broken | broken |
| New message | valid | valid | broken |
In Simple Terms
It’s like a building. If you change the foundation (tools), all floors above collapse. If you only change the top floor (messages), the foundation and middle floors stay intact.
Using the API
response = client.messages.create(
model="claude-sonnet-4-5",
system=[
{
"type": "text",
"text": "You are Beauty Intel - a trend scout...",
"cache_control": {"type": "ephemeral"} # ← cache point
}
],
tools=[
{
"name": "search_trends",
"description": "...",
"input_schema": {...},
"cache_control": {"type": "ephemeral"} # ← cache all tools
}
],
messages=[...]
)
cache_control: {"type": "ephemeral"} means: “Store everything up to here in cache”.
What Returns in Response
# First call (cache write):
# cache_creation_input_tokens: 7000 ← written to cache
# cache_read_input_tokens: 0
# Second call (cache hit):
# cache_creation_input_tokens: 0
# cache_read_input_tokens: 7000 ← read from cache!TTL: Default 5 minutes. Each cache hit refreshes the time.
Real-World Patterns
Claude Code Incremental Conversation Caching
┌─────────────────────────────────────┐ │ System prompt (~3K tokens) │ ← cached │ CLAUDE.md contents │ ← cached │ Tool definitions (50+ tools) │ ← cached ├─────────────────────────────────────┤ │ Conversation history │ ← incrementally cached │ Current user message │ ← not cached └─────────────────────────────────────┘
Each turn adds a cache breakpoint at the end of history, so the next turn can read everything before it.
Why This Works
In a conversation with Claude Code, most of the context stays constant. Only your latest message is new. Instead of processing 50K+ tokens each turn, it processes only the new message.
Cursor File Content Caching
Cursor caches file contents for autocomplete:
Request 1: Complete code in file.py
- Cache: file.py content (2000 tokens)
Request 2: Complete code in file.py (user typed more)
- Cache HIT on file.py content
- Only process: new cursor position + recent editsThis is what gives them super-fast autocomplete.
Why This Works
When you’re writing code, the file doesn’t change dramatically between keystrokes. Cursor stores the file content in cache, and each keystroke processes only the small change.
RAG Long Document Caching
The situation: You have a long document (contract, manual, book) and want to ask questions about it.
system = [
{"type": "text", "text": "You analyze legal documents..."},
{
"type": "text",
"text": entire_contract, # 100K tokens
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}
]
# Question 1: "What are the termination clauses?"
# Question 2: "What about liability limits?"
# Question 3: "Summarize payment terms"
# Each question pays 0.1x on the 100K contract tokensWhy This Works
Instead of sending the entire document with every question, you send it once with an hour TTL. Every subsequent question pays only 10% on the document. 10 questions = 90% savings.
Token Economics
| Action | Cost |
|---|---|
| Cache write (5min TTL) | 1.25x base |
| Cache write (1hr TTL) | 2.00x base |
| Cache read | 0.10x base |
| Regular input | 1.00x base |
Break-Even Analysis
Assume 10K tokens cached, Sonnet 4.5 ($3/MTok):
Without caching:
Call 1: 10K × $3/MTok = $0.030
Call 2: 10K × $3/MTok = $0.030
Call 3: 10K × $3/MTok = $0.030
Total: $0.090
With caching:
Call 1: 10K × $3.75/MTok = $0.0375 (cache write)
Call 2: 10K × $0.30/MTok = $0.003 (cache read)
Call 3: 10K × $0.30/MTok = $0.003 (cache read)
Total: $0.0435
Savings: 52%Break-even: 2 calls. From the third onwards you’re profiting.
Latency Savings
The prefill phase is O(n²) on token count. For 100K tokens:
Without caching: 5-15 seconds TTFT (time to first token)
With caching: under 1 second TTFT
This is sometimes more important than money - UX of an agent that responds immediately vs. one that waits 10 seconds.
Common Mistakes
Below Minimum Threshold
Caching requires 1024-4096 tokens minimum depending on model. Less than that = not saved.
Changes That Break Cache
Changing tool definitions, adding images, changing extended thinking settings - all break cache.
Concurrent Requests
Cache is only available after first response starts. 10 parallel requests = 10 cache writes.
JSON Key Ordering
In languages like Go/Swift, key order is random. This changes the hash = cache miss.
Sources: