Prompt Caching Explained: Real Benchmarks, Real Savings Across Claude, GPT, and Gemini

Most AI applications have a silent money leak. Every API call re-sends the same system prompt - or the same retrieved document, or the same 50-message conversation history - from scratch. The provider tokenizes it, loads it into memory, and processes it again. You pay full price every time.

Prompt caching eliminates that. You pay once to warm the cache, then a fraction of the cost on every subsequent call that shares that prefix. On a high-volume workload, the savings compound fast. 50% off is not unusual, and it often comes from a single annotation change.

This post walks through how prompt caching actually works, how each major provider implements it, and what the numbers look like against a real benchmark. All data comes from running identical tests across claude-haiku-4-5, gpt-5.4-mini, and gemini-2.5-flash.

What Is Prompt Caching?

Every time an LLM processes your input, it builds a KV (key-value) cache - a set of attention tensors computed from your tokens. Normally, that cache is thrown away when the response completes. Prompt caching keeps it alive between requests.

When your next call starts with the same prefix, the provider skips recomputing those tensors. The cached tokens are served from memory at a fraction of the compute cost - typically 10% of standard input pricing. Providers call this “cache hits,” and they show up as a distinct token category in API responses.

The prerequisite is a stable prefix. The portion you want cached - your system prompt, a retrieved document, a conversation history - must appear at the start of every call without modification. Dynamic content (the user’s question, tool call results) can follow after the cached portion; those still run at standard rates.

What can be cached

System prompts and personas
Long documents loaded via RAG
Repeated tool schemas in agentic pipelines
Conversation history up to a checkpoint

Minimum token thresholds

Anthropic Claude: varies by model - claude-haiku-4-5 requires 4,096 tokens; claude-sonnet-4-6 and claude-opus-4-8 require 1,024 tokens
OpenAI: 1,024 tokens (automatic, no configuration needed)
Google Gemini 2.5 Flash implicit: 2,048 tokens
Google Gemini 2.5 Flash explicit: 2,048 tokens

How Each Provider Implements It

The mechanics differ significantly across providers. Here is the implementation pattern for each.

Claude - Explicit Annotation

Anthropic requires you to mark the content you want cached. You add a cache_control field to any message or system block.

response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=512,
    system=[
        {
            "type": "text",
            "text": contract_text,           # the 4,500-token document
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": question}],
)

# Usage breakdown in the response
u = response.usage
cache_write = getattr(u, "cache_creation_input_tokens", 0) or 0
cache_read  = getattr(u, "cache_read_input_tokens", 0) or 0
uncached    = u.input_tokens

On the first call, cache_creation_input_tokens shows the tokens written to cache. On all subsequent calls that share the same prefix, cache_read_input_tokens shows the tokens served from cache at 90% off standard pricing.

This explicitness is Claude’s main advantage: you always know exactly what was cached, how many tokens it consumed, and whether the cache was hit or missed.

OpenAI - Automatic Prefix Caching

OpenAI does not require any code changes. Prompts longer than 1,024 tokens are automatically cached when you send the same prefix across multiple requests. The only requirement is that your system prompt remains byte-identical across calls.

SYSTEM_PROMPT = (
    "You are a legal analyst. Analyze the following service agreement "
    "and answer questions accurately.\n\n" + contract_text
)

response = client.chat.completions.create(
    model="gpt-5.4-mini",
    max_completion_tokens=512,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": question},
    ],
)

# Cache hits show up in prompt_tokens_details
u = response.usage
cached_tokens   = u.prompt_tokens_details.cached_tokens or 0
uncached_tokens = u.prompt_tokens - cached_tokens

The first call gets a full-price cache miss. Starting from the second call, cached_tokens reflects tokens served at the discounted rate. No annotation, no setup - it just works when the prefix is stable.

The trade-off: you cannot force a cache miss, cannot inspect whether a particular call contributed to the cache, and cannot predict the exact cache boundary.

Google Gemini - Two Modes

Gemini offers both automatic and explicit caching.

Implicit caching is always on - Gemini caches frequently-seen prompt prefixes automatically at no additional cost. The pricing for implicitly cached tokens is zero:

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=question,
    config=genai.types.GenerateContentConfig(
        system_instruction="You are a legal analyst...",
    ),
)

meta = response.usage_metadata
cached_tokens = meta.cached_content_token_count or 0   # free if implicit

Explicit caching lets you pre-upload a document as a CachedContent object with a TTL. This requires a 32,768-token minimum, but gives you deterministic cache hits and a lower per-token cost on reads:

cache = client.caches.create(
    model="gemini-2.5-flash",
    config=genai.types.CreateCachedContentConfig(
        system_instruction="You are a legal analyst...",
        contents=[large_document],
        ttl="3600s",
    ),
)

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=question,
    config=genai.types.GenerateContentConfig(
        cached_content=cache.name,
    ),
)

After you are done, call client.caches.delete(name=cache.name) to avoid storage charges.

The Benchmark

To produce comparable numbers, we ran the same workload across all three providers: a 4,500-token Master Service Agreement (an MSA between two enterprise companies), analyzed with five legal questions.

Questions

What are the termination clauses in this agreement?
Summarize the liability and indemnification provisions.
Are there any auto-renewal terms? What are the notice periods?
What are the payment terms and late payment penalties?
Identify any non-compete or non-solicitation clauses.

Each question was sent as a separate API call with the full contract in the system prompt. The “no cache” run used a random UUID injected into each prompt to prevent any implicit caching from activating. The “with cache” run used the implementations above.

Token Results

Here are the raw numbers from the benchmark runs.

Claude claude-haiku-4-5

Question	Input Tokens	Cache Write	Cache Read	Uncached	Output	Latency
Q1 (first call)	6,240	6,224	0	16	512	5.83s
Q2	6,240	0	6,224	16	512	5.72s
Q3	6,243	0	6,224	19	395	4.30s
Q4	6,239	0	6,224	15	352	3.97s
Q5	6,243	0	6,224	19	434	7.29s
Total	31,205	6,224	24,896	85	2,205	5.42s avg

Baseline without cache - 31,205 input tokens, 2,398 output tokens, 6.30s average.

The cache writes 6,224 tokens once on Q1 and reads them back for Q2–Q5. Uncached tokens per question are just the question text itself - a negligible 15–19 tokens.

OpenAI gpt-5.4-mini

Question	Cached Tokens	Uncached	Output	Latency
Q1 (cache miss)	0	5,627	512	3.64s
Q2	5,376	252	512	3.88s
Q3	5,376	257	368	3.07s
Q4	5,376	252	323	3.69s
Q5	5,376	254	275	2.19s
Total	21,504	6,642	1,990	3.29s avg

Baseline without cache - 28,284 input tokens, 1,995 output tokens, 3.23s average.

Notice that OpenAI caches 5,376 tokens (a multiple of 128 - that is the cache granularity boundary) rather than the full prompt length of ~5,627. The prefix is cached in fixed-size blocks.

Google Gemini gemini-2.5-flash (Implicit)

Question	Cached Tokens	Uncached	Output	Latency
Q1	0	5,785	1,048	12.42s
Q2	0	5,785	810	11.98s
Q3	0	5,790	439	8.67s
Q4	5,097 (free)	689	326	5.76s
Q5	5,097 (free)	693	269	5.63s
Total	10,194	18,742	2,892	8.89s avg

Baseline without cache - 28,936 input tokens, 3,192 output tokens, 9.31s average.

Gemini’s implicit caching kicks in after Q3 - it needs a few repetitions of the same prefix to warm the cache. Tokens served from the implicit cache cost exactly zero. The latency drop on Q4 and Q5 is also visible: from 8–12s down to ~5.7s.

One observation worth noting: even in the “no cache” baseline run designed to prevent caching, Gemini still reported cached tokens on some questions because the previous test had warmed the implicit cache. You cannot fully suppress Gemini’s automatic caching from a prior session.

The Cost Math

Pricing used (approximate as of June 2026 - verify on official provider pricing pages before making budgeting decisions)

Provider	Model	Input $/1M	Cache Write $/1M	Cache Read $/1M	Output $/1M
Anthropic	claude-haiku-4-5	$1.00	$1.25	$0.10	$5.00
OpenAI	gpt-5.4-mini	$2.50	-	$0.25	$15.00
Google	gemini-2.5-flash	$0.30	-	$0.03 (or free*)	$2.50

*Implicit cache hits on Gemini are free. Explicit cache reads are $0.03/1M tokens, with storage at $1.00/1M tokens/hour.

Session-Level Costs (5 questions on one document)

Claude claude-haiku-4-5

Without cache

Input: 31,205 × $1.00/M = $0.0312
Output: 2,398 × $5.00/M = $0.0120
Total: $0.0432

With cache

Cache write: 6,224 × $1.25/M = $0.0078
Cache reads: 24,896 × $0.10/M = $0.0025
Uncached input: 85 × $1.00/M = $0.0001
Output: 2,205 × $5.00/M = $0.0110
Total: $0.0214

→ 50.5% savings. One cache_control field.

OpenAI gpt-5.4-mini

Without cache

Input: 28,284 × $2.50/M = $0.0707
Output: 1,995 × $15.00/M = $0.0299
Total: $0.1006

With cache

Cached input: 21,504 × $0.25/M = $0.0054
Uncached input: 6,642 × $2.50/M = $0.0166
Output: 1,990 × $15.00/M = $0.0299
Total: $0.0519

→ 48.4% savings. Zero code changes.

Google gemini-2.5-flash (implicit)

Without cache

Input: 28,936 × $0.30/M = $0.0087
Output: 3,192 × $2.50/M = $0.0080
Total: $0.0167

With implicit cache

Uncached input: 18,742 × $0.30/M = $0.0056
Cached input: 10,194 × free = $0.0000
Output: 2,892 × $2.50/M = $0.0072
Total: $0.0128

→ 23.4% savings. Completely automatic, cached tokens are free.

At Production Scale

To understand real-world impact, model a production workload: 1,000 queries per day, all sharing the same 6,220-token system context.

TTL and 24-hour storage costs

Before looking at the numbers, it is worth understanding how each provider handles cache lifetime - this directly affects how many cache writes you pay for across a full day.

Provider	Default TTL	24-hr option	Storage fee	Write cost multiplier
Anthropic Claude	5 min (or 1 hr opt-in)	Not supported (max 1 hr)	None	1.25x (5 min) / 2x (1 hr)
OpenAI	Up to 1 hr inactive	Yes, on GPT-5.5/5.4 - free	None	Built into base price
Gemini implicit	Automatic	No explicit control	None	N/A
Gemini explicit	1 hr (default)	Yes, configurable	$1.00/M tokens/hr	N/A

Claude - Each cache read refreshes the TTL timer at no extra cost. For 1,000 evenly-spread queries per day (one every ~86 seconds), the 5-minute default TTL stays warm continuously - effectively one write per day. For overnight gaps or bursty patterns, you would pay one additional write per cold start. Using the 1-hour TTL costs 2x per write instead of 1.25x but reduces cold-start risk.

OpenAI - With 24-hour retention on newer models there are no additional fees. The cache persists across overnight gaps with zero change to the cost formula.

Gemini explicit - 24-hour storage cost - Storing 6,224 tokens for 24 hours

6,224 tokens / 1,000,000 × $1.00/hr × 24 hrs = $0.149/day in storage

This storage cost applies on top of the per-query cost. It is negligible for very large contexts (100K+ tokens where explicit cache is necessary) but meaningful for small ones where implicit caching is free.

Cost comparison: default TTL vs 24-hour window

Per-query cost with a warm cache (default TTL, cache stays warm via query frequency)

Claude 5-min TTL: (6,224 × $0.10 + 17 × $1.00 + 441 × $5.00) / 1M = $0.00285
OpenAI auto: (5,376 × $0.25 + 254 × $2.50 + 399 × $15.00) / 1M = $0.00796
Gemini implicit: (5,097 × $0.00 + 691 × $0.30 + 638 × $2.50) / 1M = $0.00180
Gemini explicit 24-hr: (5,097 × $0.03 + 691 × $0.30 + 638 × $2.50) / 1M = $0.00196

Per-query cost without cache

Claude: (6,241 × $1.00 + 480 × $5.00) / 1M = $0.00864
OpenAI: (5,657 × $2.50 + 399 × $15.00) / 1M = $0.02013
Gemini: (5,787 × $0.30 + 638 × $2.50) / 1M = $0.00333

1-hour window (42 queries):

Provider	Mode	Without Cache	With Cache	Storage (1 hr)	Total (cached)	Savings
Claude Haiku	1-hr TTL	$0.363	$0.129	$0.00	$0.129	$0.234
OpenAI mini	auto	$0.845	$0.347	$0.00	$0.347	$0.498
Gemini Flash	implicit	$0.140	$0.077	$0.00	$0.077	$0.063
Gemini Flash	explicit	$0.140	$0.082	$0.006	$0.088	$0.052

Claude Haiku saves the most at 64.5%, followed by OpenAI mini at 58.9%. Gemini’s implicit cache delivers 45.0% savings with zero configuration overhead; explicit cache drops to 37.1% once the $0.006/hr storage cost is factored in.

24-hour window (1,000 queries):

Provider	Mode	No Cache/day	Queries/day	Storage/day	Total/day	Annual Savings
Claude Haiku	5-min TTL	$8.64	$2.85	$0	$2.85	~$2,114
Claude Haiku	1-hr TTL	$8.64	$2.86	$0	$2.86	~$2,110
OpenAI mini	auto (up to 24 hr)	$20.13	$7.96	$0	$7.96	~$4,442
Gemini Flash	implicit (free)	$3.33	$1.80	$0	$1.80	~$559
Gemini Flash	explicit 24-hr TTL	$3.33	$1.96	$0.15	$2.11	~$445

A few things stand out from this comparison. Claude’s 1-hour TTL adds only a fraction of a cent per day over the 5-minute default for high-frequency workloads, because the write cost difference ($1.25/M vs $2.00/M) only matters on the one cold-start write per day. OpenAI’s 24-hour retention costs nothing extra. And for Gemini, the implicit cache (free) actually beats the explicit 24-hour cache for this context size - the $0.15/day storage eats into the read savings. Explicit cache becomes cost-effective on Gemini only when the context is large enough (typically above 50K tokens) that the storage cost is justified by the reliable cache hit rate.

These are per-workload numbers. A production system typically has several distinct document contexts - a different legal template per client, a different codebase per repository, a different persona per product line. Each one is a separate cache entry, and the savings stack.

Choosing Your Strategy

No single approach fits every case. Here is a practical decision guide:

Use Claude’s explicit caching when

You need full observability into cache hits and misses (the API returns exact token counts per category)
You are managing costs tightly and want to predict billing before a call, not discover it after
You have multiple cacheable segments (Claude supports up to 4 cache breakpoints per request)
You need the cache to persist up to 1 hour and want guaranteed behavior

Use OpenAI’s automatic caching when

You want zero implementation overhead - no code changes, no annotations
You are already on OpenAI and your prompts exceed 1,024 tokens with a stable prefix
You can accept less transparency in exchange for simplicity
Note: OpenAI caches in 128-token blocks, so very short prompts or frequently-changing prefixes may see inconsistent savings

Use Gemini’s implicit caching when

Cost is your primary concern - implicit cache hits are genuinely free
You are using Gemini 2.5 Flash for tasks with repetitive context (customer support, document review)
You want savings with no configuration at all

Use Gemini’s explicit caching when

Your context is large (typically 50K+ tokens) where the $1.00/M tokens/hr storage cost is justified by the reliable cache hit rate
You need deterministic cache behavior across sessions or beyond the implicit warm-up period (implicit only activates after a few repeated calls)
You can manage a CachedContent object and its TTL, and need cache to persist even during low-traffic periods

Where Caching Gets Complicated in Agentic Workflows

Single-turn Q&A is the easy case. Agentic workflows surface three problems that simple prompt caching does not solve.

Problem 1: History insertion breaks the cache prefix. Every turn in a multi-step agent appends new content to the conversation. When you insert a new user message or tool result, the prefix changes - even if your system prompt is identical. A cache built on the first three turns is invalidated on the fourth. You need infrastructure that checkpoints history in a way that caches the stable portion while correctly billing the dynamic portion.

Problem 2: Context fan-out in parallel agents. When a coordinator dispatches work to five sub-agents in parallel, each sub-agent may receive an identical copy of the shared context. Without cache sharing, you pay 5× for the same tokens. With it, you pay 1×. But sharing requires your runtime to understand which calls share a logical context, not just an identical prefix string.

Problem 3: Cache efficiency is invisible. You do not know what percentage of your runs are hitting the cache versus writing it. Across hundreds of concurrent runs, cache efficiency is the difference between the cost you expected and the bill you receive. Without per-run telemetry, you are flying blind.

These are solvable problems, but they require the execution layer to participate in cache management - not just the application code.

How AGNT5 Handles This

AGNT5’s durable runtime is designed around the assumption that production workloads are multi-step, concurrent, and expensive - and that token spend needs to be observable down to the individual function call.

Stable context, automatically. AGNT5 workflows separate durable state (what the workflow knows) from transient invocation inputs (what this particular step needs). The system prompt and shared context are attached once at the worker level; each step invocation sends only the delta. This means the stable prefix is always the same across calls, and prompt caching activates reliably without application-level management.

Per-step token attribution. Every LLM call made through AGNT5’s SDK records tokens_in, tokens_out, cache_read_tokens, cache_write_tokens, and cost_usd in the run journal. You can query cache efficiency across runs, across time, across models - directly from the AGNT5 Studio or via SQL over the archived Parquet files.

SELECT
    step_name,
    SUM(cache_read_tokens)  AS cached_tokens,
    SUM(input_tokens)       AS uncached_tokens,
    ROUND(100.0 * SUM(cache_read_tokens) /
          NULLIF(SUM(cache_read_tokens) + SUM(input_tokens), 0), 1) AS cache_hit_pct,
    SUM(cost_usd)           AS spend_usd
FROM journal_entries
WHERE run_id LIKE 'proj_%'
  AND completed_at_ms >= $since
GROUP BY step_name
ORDER BY spend_usd DESC;

Provider-agnostic. The same AGNT5 workflow definition works against Claude, OpenAI, or Gemini. Switching providers to compare cost does not require rewriting caching logic. The runtime adapts to each provider’s cache API transparently.

If you are building workflows that process repetitive context at scale - document review, code analysis, customer support, research pipelines - AGNT5 gives you prompt caching that is correct by default, observable always, and not tied to a single provider.

Try AGNT5 →

Summary

Prompt caching is one of the highest-leverage cost optimizations available to AI developers, and all three major providers support it. The implementation approaches are quite different.

Claude requires explicit annotation, gives you full visibility, and is the right choice when you need predictable, observable cache behavior.
OpenAI requires nothing - keep your prefix stable, and caching is automatic with ~75–90% savings on cached tokens.
Gemini offers two modes: zero-cost implicit caching that activates automatically, and explicit caching for very large contexts above 32,768 tokens.

Across our benchmark - a 4,500-token legal contract analyzed five ways - prompt caching cut session costs by 23–51% depending on the provider. At 1,000 queries per day, that translates to $559–$4,442 saved annually on a single document context. The savings scale linearly with volume.

The threshold for action is low. For Claude, it is one annotation. For OpenAI, it is nothing. For Gemini implicit, it is also nothing. The question is not whether to enable prompt caching - it is whether your execution layer is giving you enough visibility to know if it is working.

Pricing data as of June 2026. All numbers from third-party pricing aggregators - verify directly on Anthropic pricing, OpenAI pricing, and Google AI pricing before making budget decisions. Production-scale calculations assume 1,000 queries/day with a 6,220-token shared context, one cache write per day, and 999 cache reads.

Prompt Caching Explained: Real Benchmarks, Real Savings Across Claude, GPT, and Gemini

What Is Prompt Caching?

How Each Provider Implements It

Claude - Explicit Annotation

OpenAI - Automatic Prefix Caching

Google Gemini - Two Modes

The Benchmark

Token Results

Claude claude-haiku-4-5

OpenAI gpt-5.4-mini

Google Gemini gemini-2.5-flash (Implicit)

The Cost Math

Session-Level Costs (5 questions on one document)

At Production Scale

TTL and 24-hour storage costs

Cost comparison: default TTL vs 24-hour window

Choosing Your Strategy

Where Caching Gets Complicated in Agentic Workflows

How AGNT5 Handles This

Summary

Tags

On this page

Share this article