Prompt Caching
Reduce costs by 90% by caching repeated context in prompts
Imagine you're reading a really long book to answer different questions. Prompt caching is like putting a bookmark in the book! Instead of reading the whole book from the beginning each time, you remember where you were and just read the new question.
When you send the same instructions or context to AI multiple times, caching remembers it so you don't have to pay to send it again. It's like reusing homework instead of redoing it every time!
Example: First time: "Here are our 100-page docs... [question]" costs $1 → Next time: "Same docs (from cache)... [new question]" costs $0.10! You save 90%!
When to Use Prompt Caching
- • System instructions reused across all requests
- • Large context (documentation, code repos)
- • Few-shot examples repeated in prompts
- • Conversation history in chatbots
- • RAG systems with static knowledge bases
- • Content that changes every request (timestamps, user IDs)
- • Very small prompts (<1000 tokens)
- • Single-use requests (batch processing)
- • Highly personalized content per user
Impact Metrics
What to Cache
Prompt Caching
Prompt caching is a technique that can reduce your LLM costs by up to 90% by reusing computed context across multiple requests. If you're repeatedly sending the same system instructions, documentation, or context, you're paying for the same tokens over and over—unless you cache them.
What is Prompt Caching?
Prompt caching allows you to mark portions of your prompt as cacheable. The LLM provider stores the processed representation of that content, so subsequent requests that include the same content don't need to reprocess it from scratch.
Without caching:
- Request 1: Process 10,000 tokens of context + 50 tokens of query = $0.30
- Request 2: Process same 10,000 tokens + new 50 token query = $0.30
- Total: $0.60
With caching:
- Request 1: Process 10,000 tokens (cache miss) = $0.30
- Request 2: Read 10,000 tokens from cache + process 50 new tokens = $0.03
- Total: $0.33 (45% savings, and it gets better with more requests)
Why Prompt Caching Matters
Massive cost savings: 90% reduction on cached tokens after the first request
Faster responses: Cached content is processed faster, reducing latency
Scales naturally: The more users you have, the more savings you get
No code changes needed: Simple API modifications enable caching
How It Works
Anthropic's Prompt Caching
Mark cacheable sections with cache control breakpoints:
const response = await anthropic.messages.create({
model: "claude-sonnet-4.5",
max_tokens: 1024,
system: [
{
type: "text",
text: "You are an expert customer support agent...",
cache_control: { type: "ephemeral" }
},
{
type: "text",
text: "documentationContent", // Large, repeated content
cache_control: { type: "ephemeral" }
}
],
messages: [
{ role: "user", content: "userQuestion" }
]
})
The cached content persists for 5 minutes. Any request within that window reuses the cache.
OpenAI's Context Caching
Currently in beta, works automatically for repeated prefixes:
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: "largeSystemPrompt" // Automatically cached if repeated
},
{
role: "user",
content: "userQuestion"
}
]
})
When to Use Prompt Caching
Perfect for:
System instructions: Reused across all user requests
system: [
{
text: "You are a coding assistant. Follow these rules...",
cache_control: { type: "ephemeral" }
}
]
Large context: Documentation, code repos, or knowledge bases
system: [
{
text: "systemInstructions",
cache_control: { type: "ephemeral" }
},
{
text: "fullAPIDocumentation", // 50k tokens
cache_control: { type: "ephemeral" }
}
]
Few-shot examples: Repeated examples teaching the model
system: [
{
text: "instructions"
},
{
text: "Example 1:\nInput: ...\nOutput: ...\n\n" +
"Example 2:\nInput: ...\nOutput: ...\n\n" +
"Example 3:\nInput: ...\nOutput: ...\n\n",
cache_control: { type: "ephemeral" }
}
]
Conversation history: In chatbots, cache the older messages
messages: [
{
role: "user",
content: "oldMessage1"
},
{
role: "assistant",
content: "oldResponse1"
},
// ... more history
{
role: "user",
content: "oldMessage10",
cache_control: { type: "ephemeral" } // Cache up to here
},
{
role: "user",
content: "newMessage" // Only this is new
}
]
Optimization Strategies
Cache Hit Rate Maximization
Consistent ordering: Always structure your prompt the same way. Changing order breaks the cache.
// BAD: Inconsistent order
system: [
{ text: random.choose([instructionsA, instructionsB]) }
]
// GOOD: Always the same
system: [
{ text: standardInstructions }
]
Minimize variance: Extract dynamic parts (user name, date) from cacheable sections.
// BAD: User-specific, won't cache well
{ text: `Hello \${userName}, you are a support agent...` }
// GOOD: Generic, caches for all users
{ text: "You are a support agent..." }
// Include userName in the user message instead
Hierarchical Caching
Use multiple cache breakpoints for different update frequencies:
system: [
{
text: "systemInstructions", // Never changes
cache_control: { type: "ephemeral" }
},
{
text: "productDocumentation", // Updates weekly
cache_control: { type: "ephemeral" }
},
{
text: "recentUpdates", // Updates daily
cache_control: { type: "ephemeral" }
}
]
When recent updates change, you only invalidate the last cache block.
Threshold for Caching
Only cache if content exceeds minimum size:
function shouldCache(content) {
const tokenCount = estimateTokens(content)
// Cache if > 1000 tokens and used multiple times
return tokenCount > 1000
}
Small content isn't worth caching due to overhead.
Measuring Cache Performance
Track These Metrics
{
cacheHitRate: 0.85, // 85% of requests hit cache
avgCostSavings: "$0.15 per request",
cacheMisses: 150, // Out of 1000 requests
avgLatencyReduction: "200ms"
}
Calculate ROI
// Without caching
const costWithoutCaching = requests * tokensPerRequest * pricePerToken
// With caching
const costWithCaching = (
(cacheMisses * tokensPerRequest * pricePerToken) +
(cacheHits * tokensPerRequest * cachedTokenPrice)
)
const savings = costWithoutCaching - costWithCaching
const savingsPercent = (savings / costWithoutCaching) * 100
Common Pitfalls
Caching dynamic content: Don't cache content that changes per request.
// BAD
{
text: `Current time: \${new Date()}`,
cache_control: { type: "ephemeral" }
}
Too many cache blocks: Each cache breakpoint adds overhead. Use 2-4, not 20.
Cache expiration surprises: Caches expire after 5 minutes (Anthropic). Don't assume indefinite caching.
Ignoring cache headers in responses: Check usage.cache_read_tokens to verify caching worked.
Advanced Patterns
Versioned Caching
Append version to cache content when updates happen:
{
text: `v2.1.0\n\${documentation}`,
cache_control: { type: "ephemeral" }
}
When documentation updates, bump the version to bust cache.
User-Scoped Caching
Cache user-specific context for personalized experiences:
system: [
{
text: "sharedInstructions",
cache_control: { type: "ephemeral" }
},
{
text: `User preferences: \${JSON.stringify(userPrefs)}`,
cache_control: { type: "ephemeral" }
}
]
Cache Warming
Pre-warm cache for common queries:
// On app startup, send dummy request to populate cache
await anthropic.messages.create({
model: "claude-sonnet-4.5",
max_tokens: 1,
system: [{ text: "documentation", cache_control: { type: "ephemeral" } }],
messages: [{ role: "user", content: "warmup" }]
})
Subsequent user requests immediately benefit from cached context.
Real-World Example
Customer Support Chatbot
async function handleSupportQuery(userMessage) {
const response = await anthropic.messages.create({
model: "claude-sonnet-4.5",
max_tokens: 1024,
system: [
{
type: "text",
text: "You are a helpful customer support agent for Acme Inc.",
cache_control: { type: "ephemeral" }
},
{
type: "text",
text: await loadProductDocs(), // 30k tokens
cache_control: { type: "ephemeral" }
},
{
type: "text",
text: await loadFAQ(), // 10k tokens
cache_control: { type: "ephemeral" }
}
],
messages: [
{ role: "user", content: userMessage }
]
})
console.log(`Cache read: \${response.usage.cache_read_input_tokens} tokens`)
return response.content[0].text
}
First request: Processes 40k tokens, costs $1.20 Subsequent requests: Reads 40k from cache, processes only the user message. Costs $0.12 Savings: 90% on every request after the first
Best Practices
- Cache static content: System instructions, docs, examples
- Keep cache blocks large: Aim for 1k+ tokens per block
- Monitor hit rates: Track cache performance in production
- Version your cache: Bust cache intentionally when content updates
- Test without caching: Ensure your prompts work both ways
When NOT to Cache
- Content changes every request (timestamps, random values)
- Very small prompts (< 1000 tokens)
- Single-use requests (batch processing)
- Highly personalized content that won't benefit multiple users
Prompt caching is one of the easiest and highest-ROI optimizations for production AI applications. If you're sending repeated context, you should be caching it. The savings are immediate and dramatic.