Prompt Caching - How to Save 90% on API Costs

The difference between a $100/day agent and a $10/day agent is often just prompt caching.

- Market Wisdom

Glossary

Before we dive deep, let’s understand the basic terms:

Prefix

The prefix is the initial part of your prompt - everything that comes before the user’s latest message.

Think of it like an address: “USA, California, San Francisco” - each level is a prefix of what follows.

In a prompt, the prefix includes: system prompt, tool definitions, conversation history - everything that’s constant and doesn’t change between requests.

Hash

A hash is a mathematical function that takes text of any length and returns a short, unique “fingerprint”.

"Hello World" → hash → "abc123"
"Hello World!" → hash → "xyz789" // One exclamation mark changed everything!

The hash serves as an identifier key: if two prompts produce the same hash, they’re identical.

Cache Write vs Cache Read

Cache Write happens the first time you send a particular prompt - costs 1.25x but it’s a one-time investment.

Cache Read happens when you send a prompt with an identical prefix - costs only 0.1x, a tenth of the price!

The Problem

Every API call to Claude does the same thing: sends tokens, waits for response. But what happens inside isn’t trivial.

User sends: "What's the weather in Tel Aviv?"

API receives:
┌─────────────────────────────────────────────┐
│ System: "You are a helpful assistant..."    │ 2,000 tokens
│ Tools: [get_weather, search, ...]           │ 5,000 tokens
│ History: [previous 10 turns]                │ 8,000 tokens
│ User: "What's the weather in Tel Aviv?"     │ 10 tokens
└─────────────────────────────────────────────┘
                                      Total: 15,010 tokens

The model needs to process all 15,010 tokens before it can start generating a response.

Without caching vs With caching

Without: Process 15,000+ tokens again.

With: Process only the new tokens!

Just like you don’t need to explain who you are every time you call a good friend - your agent doesn’t need to “read” the system prompt and tools from scratch on every request.

What is KV Cache?

To understand prompt caching, you need to understand what happens inside a Transformer.

Prefill - The Expensive Part

When the prompt enters, each token passes through every layer of the model. At each layer, two vectors are computed: Key and Value.

Key = “What do I represent?”

Value = “What’s my content?”

This is the expensive part. For a 100K token prompt and a model with 80 layers, you compute 16 million vectors.

Generation - The Cheap Part

After prefill, the model generates one token at a time. Each new token requires only one KV computation (its own), then attention over all existing KVs.

The Caching Insight

The key insight: KV pairs for a constant prefix don’t change between calls.

Call 1:
[System + Tools] → compute KV pairs → STORE IN CACHE
[History + User] → compute KV pairs → generate response

Call 2:
[System + Tools] → LOAD FROM CACHE (skip computation!)
[New History + User] → compute KV pairs → generate response

Prompt caching = storing the KV cache of a constant prefix and reusing it.

How The Cache Works

Prefix Hashing

The cache works on prefix matching. The key is a hash of all tokens up to the cache point:

"You are helpful" + "Tools: ..." → hash: abc123
"You are helpful" + "Tools: ..." + " extra" → hash: def456 (different!)

Any change to the prefix, even one character, creates a different hash = cache miss.

Hierarchy Order

Claude requires a fixed order: tools → system → messages

Any change at a higher level breaks the cache for everything below:

Change	tools cache	system cache	messages cache
Add tool	broken	broken	broken
Change system	valid	broken	broken
New message	valid	valid	broken

In Simple Terms

It’s like a building. If you change the foundation (tools), all floors above collapse. If you only change the top floor (messages), the foundation and middle floors stay intact.

Using the API

response = client.messages.create(
    model="claude-sonnet-4-5",
    system=[
        {
            "type": "text",
            "text": "You are Beauty Intel - a trend scout...",
            "cache_control": {"type": "ephemeral"}  # ← cache point
        }
    ],
    tools=[
        {
            "name": "search_trends",
            "description": "...",
            "input_schema": {...},
            "cache_control": {"type": "ephemeral"}  # ← cache all tools
        }
    ],
    messages=[...]
)

cache_control: {"type": "ephemeral"} means: “Store everything up to here in cache”.

What Returns in Response

# First call (cache write):
# cache_creation_input_tokens: 7000  ← written to cache
# cache_read_input_tokens: 0

# Second call (cache hit):
# cache_creation_input_tokens: 0
# cache_read_input_tokens: 7000  ← read from cache!

TTL: Default 5 minutes. Each cache hit refreshes the time.

Real-World Patterns

Claude Code Incremental Conversation Caching

┌─────────────────────────────────────┐
│ System prompt (~3K tokens)          │ ← cached
│ CLAUDE.md contents                  │ ← cached
│ Tool definitions (50+ tools)        │ ← cached
├─────────────────────────────────────┤
│ Conversation history                │ ← incrementally cached
│ Current user message                │ ← not cached
└─────────────────────────────────────┘

Each turn adds a cache breakpoint at the end of history, so the next turn can read everything before it.

Why This Works

In a conversation with Claude Code, most of the context stays constant. Only your latest message is new. Instead of processing 50K+ tokens each turn, it processes only the new message.

Cursor File Content Caching

Cursor caches file contents for autocomplete:

Request 1: Complete code in file.py
  - Cache: file.py content (2000 tokens)

Request 2: Complete code in file.py (user typed more)
  - Cache HIT on file.py content
  - Only process: new cursor position + recent edits

This is what gives them super-fast autocomplete.

Why This Works

When you’re writing code, the file doesn’t change dramatically between keystrokes. Cursor stores the file content in cache, and each keystroke processes only the small change.

RAG Long Document Caching

The situation: You have a long document (contract, manual, book) and want to ask questions about it.

system = [
    {"type": "text", "text": "You analyze legal documents..."},
    {
        "type": "text",
        "text": entire_contract,  # 100K tokens
        "cache_control": {"type": "ephemeral", "ttl": "1h"}
    }
]

# Question 1: "What are the termination clauses?"
# Question 2: "What about liability limits?"
# Question 3: "Summarize payment terms"

# Each question pays 0.1x on the 100K contract tokens

Why This Works

Instead of sending the entire document with every question, you send it once with an hour TTL. Every subsequent question pays only 10% on the document. 10 questions = 90% savings.

Token Economics

Action	Cost
Cache write (5min TTL)	1.25x base
Cache write (1hr TTL)	2.00x base
Cache read	0.10x base
Regular input	1.00x base

Break-Even Analysis

Assume 10K tokens cached, Sonnet 4.5 ($3/MTok):

Without caching:
  Call 1: 10K × $3/MTok = $0.030
  Call 2: 10K × $3/MTok = $0.030
  Call 3: 10K × $3/MTok = $0.030
  Total: $0.090

With caching:
  Call 1: 10K × $3.75/MTok = $0.0375 (cache write)
  Call 2: 10K × $0.30/MTok = $0.003  (cache read)
  Call 3: 10K × $0.30/MTok = $0.003  (cache read)
  Total: $0.0435

Savings: 52%

Break-even: 2 calls. From the third onwards you’re profiting.

Latency Savings

The prefill phase is O(n²) on token count. For 100K tokens:

Without caching: 5-15 seconds TTFT (time to first token)

With caching: under 1 second TTFT

This is sometimes more important than money - UX of an agent that responds immediately vs. one that waits 10 seconds.

Common Mistakes

Below Minimum Threshold

Caching requires 1024-4096 tokens minimum depending on model. Less than that = not saved.

Changes That Break Cache

Changing tool definitions, adding images, changing extended thinking settings - all break cache.

Concurrent Requests

Cache is only available after first response starts. 10 parallel requests = 10 cache writes.

JSON Key Ordering

In languages like Go/Swift, key order is random. This changes the hash = cache miss.

Sources: