Prompt Caching

Reduce costs by 90% by caching repeated context in prompts

8 min read•Updated 11/9/2025

💡ELI5: What is Prompt Caching?

Imagine you're reading a really long book to answer different questions. Prompt caching is like putting a bookmark in the book! Instead of reading the whole book from the beginning each time, you remember where you were and just read the new question.

When you send the same instructions or context to AI multiple times, caching remembers it so you don't have to pay to send it again. It's like reusing homework instead of redoing it every time!

Example: First time: "Here are our 100-page docs... [question]" costs $1 → Next time: "Same docs (from cache)... [new question]" costs $0.10! You save 90%!

🛠️For Product Managers & Builders

When to Use Prompt Caching

Perfect for:

• System instructions reused across all requests
• Large context (documentation, code repos)
• Few-shot examples repeated in prompts
• Conversation history in chatbots
• RAG systems with static knowledge bases

Not beneficial for:

• Content that changes every request (timestamps, user IDs)
• Very small prompts (<1000 tokens)
• Single-use requests (batch processing)
• Highly personalized content per user

Impact Metrics

90%

Cost Reduction

40%

Faster Response

5min

Cache Duration

What to Cache

System Instructions

Shared across all users

Large Documentation

API docs, knowledge bases

Few-Shot Examples

Training examples in prompts

Chat History

Previous conversation turns

Deep Dive

Prompt Caching

Prompt caching is a technique that can reduce your LLM costs by up to 90% by reusing computed context across multiple requests. If you're repeatedly sending the same system instructions, documentation, or context, you're paying for the same tokens over and over—unless you cache them.

What is Prompt Caching?

Prompt caching allows you to mark portions of your prompt as cacheable. The LLM provider stores the processed representation of that content, so subsequent requests that include the same content don't need to reprocess it from scratch.

Without caching:

Request 1: Process 10,000 tokens of context + 50 tokens of query = $0.30
Request 2: Process same 10,000 tokens + new 50 token query = $0.30
Total: $0.60

With caching:

Request 1: Process 10,000 tokens (cache miss) = $0.30
Request 2: Read 10,000 tokens from cache + process 50 new tokens = $0.03
Total: $0.33 (45% savings, and it gets better with more requests)

Why Prompt Caching Matters

Massive cost savings: 90% reduction on cached tokens after the first request

Faster responses: Cached content is processed faster, reducing latency

Scales naturally: The more users you have, the more savings you get

No code changes needed: Simple API modifications enable caching

How It Works

Anthropic's Prompt Caching

Mark cacheable sections with cache control breakpoints:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4.5",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: "You are an expert customer support agent...",
      cache_control: { type: "ephemeral" }
    },
    {
      type: "text",
      text: "documentationContent", // Large, repeated content
      cache_control: { type: "ephemeral" }
    }
  ],
  messages: [
    { role: "user", content: "userQuestion" }
  ]
})

The cached content persists for 5 minutes. Any request within that window reuses the cache.

OpenAI's Context Caching

Currently in beta, works automatically for repeated prefixes:

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "system",
      content: "largeSystemPrompt" // Automatically cached if repeated
    },
    {
      role: "user",
      content: "userQuestion"
    }
  ]
})

When to Use Prompt Caching

Perfect for:

System instructions: Reused across all user requests

system: [
  {
    text: "You are a coding assistant. Follow these rules...",
    cache_control: { type: "ephemeral" }
  }
]

Large context: Documentation, code repos, or knowledge bases

system: [
  {
    text: "systemInstructions",
    cache_control: { type: "ephemeral" }
  },
  {
    text: "fullAPIDocumentation", // 50k tokens
    cache_control: { type: "ephemeral" }
  }
]

Few-shot examples: Repeated examples teaching the model

system: [
  {
    text: "instructions"
  },
  {
    text: "Example 1:\nInput: ...\nOutput: ...\n\n" +
          "Example 2:\nInput: ...\nOutput: ...\n\n" +
          "Example 3:\nInput: ...\nOutput: ...\n\n",
    cache_control: { type: "ephemeral" }
  }
]

Conversation history: In chatbots, cache the older messages

messages: [
  {
    role: "user",
    content: "oldMessage1"
  },
  {
    role: "assistant",
    content: "oldResponse1"
  },
  // ... more history
  {
    role: "user",
    content: "oldMessage10",
    cache_control: { type: "ephemeral" } // Cache up to here
  },
  {
    role: "user",
    content: "newMessage" // Only this is new
  }
]

Optimization Strategies

Cache Hit Rate Maximization

Consistent ordering: Always structure your prompt the same way. Changing order breaks the cache.

// BAD: Inconsistent order
system: [
  { text: random.choose([instructionsA, instructionsB]) }
]

// GOOD: Always the same
system: [
  { text: standardInstructions }
]

Minimize variance: Extract dynamic parts (user name, date) from cacheable sections.

// BAD: User-specific, won't cache well
{ text: `Hello \${userName}, you are a support agent...` }

// GOOD: Generic, caches for all users
{ text: "You are a support agent..." }
// Include userName in the user message instead

Hierarchical Caching

Use multiple cache breakpoints for different update frequencies:

system: [
  {
    text: "systemInstructions", // Never changes
    cache_control: { type: "ephemeral" }
  },
  {
    text: "productDocumentation", // Updates weekly
    cache_control: { type: "ephemeral" }
  },
  {
    text: "recentUpdates", // Updates daily
    cache_control: { type: "ephemeral" }
  }
]

When recent updates change, you only invalidate the last cache block.

Threshold for Caching

Only cache if content exceeds minimum size:

function shouldCache(content) {
  const tokenCount = estimateTokens(content)
  // Cache if > 1000 tokens and used multiple times
  return tokenCount > 1000
}

Small content isn't worth caching due to overhead.

Measuring Cache Performance

Track These Metrics

{
  cacheHitRate: 0.85, // 85% of requests hit cache
  avgCostSavings: "$0.15 per request",
  cacheMisses: 150, // Out of 1000 requests
  avgLatencyReduction: "200ms"
}

Calculate ROI

// Without caching
const costWithoutCaching = requests * tokensPerRequest * pricePerToken

// With caching
const costWithCaching = (
  (cacheMisses * tokensPerRequest * pricePerToken) +
  (cacheHits * tokensPerRequest * cachedTokenPrice)
)

const savings = costWithoutCaching - costWithCaching
const savingsPercent = (savings / costWithoutCaching) * 100

Common Pitfalls

Caching dynamic content: Don't cache content that changes per request.

// BAD
{
  text: `Current time: \${new Date()}`,
  cache_control: { type: "ephemeral" }
}

Too many cache blocks: Each cache breakpoint adds overhead. Use 2-4, not 20.

Cache expiration surprises: Caches expire after 5 minutes (Anthropic). Don't assume indefinite caching.

Ignoring cache headers in responses: Check usage.cache_read_tokens to verify caching worked.

Advanced Patterns

Versioned Caching

Append version to cache content when updates happen:

{
  text: `v2.1.0\n\${documentation}`,
  cache_control: { type: "ephemeral" }
}

When documentation updates, bump the version to bust cache.

User-Scoped Caching

Cache user-specific context for personalized experiences:

system: [
  {
    text: "sharedInstructions",
    cache_control: { type: "ephemeral" }
  },
  {
    text: `User preferences: \${JSON.stringify(userPrefs)}`,
    cache_control: { type: "ephemeral" }
  }
]

Cache Warming

Pre-warm cache for common queries:

// On app startup, send dummy request to populate cache
await anthropic.messages.create({
  model: "claude-sonnet-4.5",
  max_tokens: 1,
  system: [{ text: "documentation", cache_control: { type: "ephemeral" } }],
  messages: [{ role: "user", content: "warmup" }]
})

Subsequent user requests immediately benefit from cached context.

Real-World Example

Customer Support Chatbot

async function handleSupportQuery(userMessage) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4.5",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: "You are a helpful customer support agent for Acme Inc.",
        cache_control: { type: "ephemeral" }
      },
      {
        type: "text",
        text: await loadProductDocs(), // 30k tokens
        cache_control: { type: "ephemeral" }
      },
      {
        type: "text",
        text: await loadFAQ(), // 10k tokens
        cache_control: { type: "ephemeral" }
      }
    ],
    messages: [
      { role: "user", content: userMessage }
    ]
  })

  console.log(`Cache read: \${response.usage.cache_read_input_tokens} tokens`)
  return response.content[0].text
}

First request: Processes 40k tokens, costs $1.20 Subsequent requests: Reads 40k from cache, processes only the user message. Costs $0.12 Savings: 90% on every request after the first

Best Practices

Cache static content: System instructions, docs, examples
Keep cache blocks large: Aim for 1k+ tokens per block
Monitor hit rates: Track cache performance in production
Version your cache: Bust cache intentionally when content updates
Test without caching: Ensure your prompts work both ways

When NOT to Cache

Content changes every request (timestamps, random values)
Very small prompts (< 1000 tokens)
Single-use requests (batch processing)
Highly personalized content that won't benefit multiple users

Prompt caching is one of the easiest and highest-ROI optimizations for production AI applications. If you're sending repeated context, you should be caching it. The savings are immediate and dramatic.

Related Resources

Anthropic Prompt Caching

Official caching documentation

Cost Optimization Guide

Strategies to reduce AI costs

Continue Learning

Prompt Engineering

Design cacheable prompts

RAG Systems

Cache retrieved context

Agent Memory

Cache conversation history

Agentic Workflows

Optimize agent costs with caching

Ready to Reduce AI Costs by 90%?

Explore optimization techniques and cost-saving strategies

View Cost Guides Continue Learning