We ran identical legal document analysis tasks against Claude, OpenAI, and Gemini - with and without prompt caching. Here are the exact token counts, latency numbers, and cost math at every scale.
Most AI applications have a silent money leak. Every API call re-sends the same system prompt - or the same retrieved document, or the same 50-message conversation history - from scratch. The provider tokenizes it, loads it into memory, and processes it again. You pay full price every time.
Prompt caching eliminates that. You pay once to warm the cache, then a fraction of the cost on every subsequent call that shares that prefix. On a high-volume workload, the savings compound fast. 50% off is not unusual, and it often comes from a single annotation change.
This post walks through how prompt caching actually works, how each major provider implements it, and what the numbers look like against a real benchmark. All data comes from running identical tests across claude-haiku-4-5, gpt-5.4-mini, and gemini-2.5-flash.
What Is Prompt Caching?
Every time an LLM processes your input, it builds a KV (key-value) cache - a set of attention tensors computed from your tokens. Normally, that cache is thrown away when the response completes. Prompt caching keeps it alive between requests.
When your next call starts with the same prefix, the provider skips recomputing those tensors. The cached tokens are served from memory at a fraction of the compute cost - typically 10% of standard input pricing. Providers call this “cache hits,” and they show up as a distinct token category in API responses.
The prerequisite is a stable prefix. The portion you want cached - your system prompt, a retrieved document, a conversation history - must appear at the start of every call without modification. Dynamic content (the user’s question, tool call results) can follow after the cached portion; those still run at standard rates.
What can be cached
- System prompts and personas
- Long documents loaded via RAG
- Repeated tool schemas in agentic pipelines
- Conversation history up to a checkpoint
Minimum token thresholds
- Anthropic Claude: varies by model - claude-haiku-4-5 requires 4,096 tokens; claude-sonnet-4-6 and claude-opus-4-8 require 1,024 tokens
- OpenAI: 1,024 tokens (automatic, no configuration needed)
- Google Gemini 2.5 Flash implicit: 2,048 tokens
- Google Gemini 2.5 Flash explicit: 2,048 tokens
How Each Provider Implements It
The mechanics differ significantly across providers. Here is the implementation pattern for each.
Claude - Explicit Annotation
Anthropic requires you to mark the content you want cached. You add a cache_control field to any message or system block.
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
system=[
{
"type": "text",
"text": contract_text, # the 4,500-token document
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": question}],
)
# Usage breakdown in the response
u = response.usage
cache_write = getattr(u, "cache_creation_input_tokens", 0) or 0
cache_read = getattr(u, "cache_read_input_tokens", 0) or 0
uncached = u.input_tokensOn the first call, cache_creation_input_tokens shows the tokens written to cache. On all subsequent calls that share the same prefix, cache_read_input_tokens shows the tokens served from cache at 90% off standard pricing.
This explicitness is Claude’s main advantage: you always know exactly what was cached, how many tokens it consumed, and whether the cache was hit or missed.
OpenAI - Automatic Prefix Caching
OpenAI does not require any code changes. Prompts longer than 1,024 tokens are automatically cached when you send the same prefix across multiple requests. The only requirement is that your system prompt remains byte-identical across calls.
SYSTEM_PROMPT = (
"You are a legal analyst. Analyze the following service agreement "
"and answer questions accurately.\n\n" + contract_text
)
response = client.chat.completions.create(
model="gpt-5.4-mini",
max_completion_tokens=512,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
],
)
# Cache hits show up in prompt_tokens_details
u = response.usage
cached_tokens = u.prompt_tokens_details.cached_tokens or 0
uncached_tokens = u.prompt_tokens - cached_tokensThe first call gets a full-price cache miss. Starting from the second call, cached_tokens reflects tokens served at the discounted rate. No annotation, no setup - it just works when the prefix is stable.
The trade-off: you cannot force a cache miss, cannot inspect whether a particular call contributed to the cache, and cannot predict the exact cache boundary.
Google Gemini - Two Modes
Gemini offers both automatic and explicit caching.
Implicit caching is always on - Gemini caches frequently-seen prompt prefixes automatically at no additional cost. The pricing for implicitly cached tokens is zero:
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=question,
config=genai.types.GenerateContentConfig(
system_instruction="You are a legal analyst...",
),
)
meta = response.usage_metadata
cached_tokens = meta.cached_content_token_count or 0 # free if implicitExplicit caching lets you pre-upload a document as a CachedContent object with a TTL. This requires a 32,768-token minimum, but gives you deterministic cache hits and a lower per-token cost on reads:
cache = client.caches.create(
model="gemini-2.5-flash",
config=genai.types.CreateCachedContentConfig(
system_instruction="You are a legal analyst...",
contents=[large_document],
ttl="3600s",
),
)
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=question,
config=genai.types.GenerateContentConfig(
cached_content=cache.name,
),
)After you are done, call client.caches.delete(name=cache.name) to avoid storage charges.
The Benchmark
To produce comparable numbers, we ran the same workload across all three providers: a 4,500-token Master Service Agreement (an MSA between two enterprise companies), analyzed with five legal questions.
Questions
- What are the termination clauses in this agreement?
- Summarize the liability and indemnification provisions.
- Are there any auto-renewal terms? What are the notice periods?
- What are the payment terms and late payment penalties?
- Identify any non-compete or non-solicitation clauses.
Each question was sent as a separate API call with the full contract in the system prompt. The “no cache” run used a random UUID injected into each prompt to prevent any implicit caching from activating. The “with cache” run used the implementations above.
Token Results
Here are the raw numbers from the benchmark runs.
Claude claude-haiku-4-5
| Question | Input Tokens | Cache Write | Cache Read | Uncached | Output | Latency |
|---|---|---|---|---|---|---|
| Q1 (first call) | 6,240 | 6,224 | 0 | 16 | 512 | 5.83s |
| Q2 | 6,240 | 0 | 6,224 | 16 | 512 | 5.72s |
| Q3 | 6,243 | 0 | 6,224 | 19 | 395 | 4.30s |
| Q4 | 6,239 | 0 | 6,224 | 15 | 352 | 3.97s |
| Q5 | 6,243 | 0 | 6,224 | 19 | 434 | 7.29s |
| Total | 31,205 | 6,224 | 24,896 | 85 | 2,205 | 5.42s avg |
Baseline without cache - 31,205 input tokens, 2,398 output tokens, 6.30s average.
The cache writes 6,224 tokens once on Q1 and reads them back for Q2–Q5. Uncached tokens per question are just the question text itself - a negligible 15–19 tokens.
OpenAI gpt-5.4-mini
| Question | Cached Tokens | Uncached | Output | Latency |
|---|---|---|---|---|
| Q1 (cache miss) | 0 | 5,627 | 512 | 3.64s |
| Q2 | 5,376 | 252 | 512 | 3.88s |
| Q3 | 5,376 | 257 | 368 | 3.07s |
| Q4 | 5,376 | 252 | 323 | 3.69s |
| Q5 | 5,376 | 254 | 275 | 2.19s |
| Total | 21,504 | 6,642 | 1,990 | 3.29s avg |
Baseline without cache - 28,284 input tokens, 1,995 output tokens, 3.23s average.
Notice that OpenAI caches 5,376 tokens (a multiple of 128 - that is the cache granularity boundary) rather than the full prompt length of ~5,627. The prefix is cached in fixed-size blocks.
Google Gemini gemini-2.5-flash (Implicit)
| Question | Cached Tokens | Uncached | Output | Latency |
|---|---|---|---|---|
| Q1 | 0 | 5,785 | 1,048 | 12.42s |
| Q2 | 0 | 5,785 | 810 | 11.98s |
| Q3 | 0 | 5,790 | 439 | 8.67s |
| Q4 | 5,097 (free) | 689 | 326 | 5.76s |
| Q5 | 5,097 (free) | 693 | 269 | 5.63s |
| Total | 10,194 | 18,742 | 2,892 | 8.89s avg |
Baseline without cache - 28,936 input tokens, 3,192 output tokens, 9.31s average.
Gemini’s implicit caching kicks in after Q3 - it needs a few repetitions of the same prefix to warm the cache. Tokens served from the implicit cache cost exactly zero. The latency drop on Q4 and Q5 is also visible: from 8–12s down to ~5.7s.
One observation worth noting: even in the “no cache” baseline run designed to prevent caching, Gemini still reported cached tokens on some questions because the previous test had warmed the implicit cache. You cannot fully suppress Gemini’s automatic caching from a prior session.
The Cost Math
Pricing used (approximate as of June 2026 - verify on official provider pricing pages before making budgeting decisions)
| Provider | Model | Input $/1M | Cache Write $/1M | Cache Read $/1M | Output $/1M |
|---|---|---|---|---|---|
| Anthropic | claude-haiku-4-5 | $1.00 | $1.25 | $0.10 | $5.00 |
| OpenAI | gpt-5.4-mini | $2.50 | - | $0.25 | $15.00 |
| gemini-2.5-flash | $0.30 | - | $0.03 (or free*) | $2.50 |
*Implicit cache hits on Gemini are free. Explicit cache reads are $0.03/1M tokens, with storage at $1.00/1M tokens/hour.
Session-Level Costs (5 questions on one document)
Claude claude-haiku-4-5
Without cache
- Input: 31,205 × $1.00/M = $0.0312
- Output: 2,398 × $5.00/M = $0.0120
- Total: $0.0432
With cache
- Cache write: 6,224 × $1.25/M = $0.0078
- Cache reads: 24,896 × $0.10/M = $0.0025
- Uncached input: 85 × $1.00/M = $0.0001
- Output: 2,205 × $5.00/M = $0.0110
- Total: $0.0214
→ 50.5% savings. One cache_control field.
OpenAI gpt-5.4-mini
Without cache
- Input: 28,284 × $2.50/M = $0.0707
- Output: 1,995 × $15.00/M = $0.0299
- Total: $0.1006
With cache
- Cached input: 21,504 × $0.25/M = $0.0054
- Uncached input: 6,642 × $2.50/M = $0.0166
- Output: 1,990 × $15.00/M = $0.0299
- Total: $0.0519
→ 48.4% savings. Zero code changes.
Google gemini-2.5-flash (implicit)
Without cache
- Input: 28,936 × $0.30/M = $0.0087
- Output: 3,192 × $2.50/M = $0.0080
- Total: $0.0167
With implicit cache
- Uncached input: 18,742 × $0.30/M = $0.0056
- Cached input: 10,194 × free = $0.0000
- Output: 2,892 × $2.50/M = $0.0072
- Total: $0.0128
→ 23.4% savings. Completely automatic, cached tokens are free.
At Production Scale
To understand real-world impact, model a production workload: 1,000 queries per day, all sharing the same 6,220-token system context.
TTL and 24-hour storage costs
Before looking at the numbers, it is worth understanding how each provider handles cache lifetime - this directly affects how many cache writes you pay for across a full day.
| Provider | Default TTL | 24-hr option | Storage fee | Write cost multiplier |
|---|---|---|---|---|
| Anthropic Claude | 5 min (or 1 hr opt-in) | Not supported (max 1 hr) | None | 1.25x (5 min) / 2x (1 hr) |
| OpenAI | Up to 1 hr inactive | Yes, on GPT-5.5/5.4 - free | None | Built into base price |
| Gemini implicit | Automatic | No explicit control | None | N/A |
| Gemini explicit | 1 hr (default) | Yes, configurable | $1.00/M tokens/hr | N/A |
Claude - Each cache read refreshes the TTL timer at no extra cost. For 1,000 evenly-spread queries per day (one every ~86 seconds), the 5-minute default TTL stays warm continuously - effectively one write per day. For overnight gaps or bursty patterns, you would pay one additional write per cold start. Using the 1-hour TTL costs 2x per write instead of 1.25x but reduces cold-start risk.
OpenAI - With 24-hour retention on newer models there are no additional fees. The cache persists across overnight gaps with zero change to the cost formula.
Gemini explicit - 24-hour storage cost - Storing 6,224 tokens for 24 hours
6,224 tokens / 1,000,000 × $1.00/hr × 24 hrs = $0.149/day in storageThis storage cost applies on top of the per-query cost. It is negligible for very large contexts (100K+ tokens where explicit cache is necessary) but meaningful for small ones where implicit caching is free.
Cost comparison: default TTL vs 24-hour window
Per-query cost with a warm cache (default TTL, cache stays warm via query frequency)
- Claude 5-min TTL:
(6,224 × $0.10 + 17 × $1.00 + 441 × $5.00) / 1M= $0.00285 - OpenAI auto:
(5,376 × $0.25 + 254 × $2.50 + 399 × $15.00) / 1M= $0.00796 - Gemini implicit:
(5,097 × $0.00 + 691 × $0.30 + 638 × $2.50) / 1M= $0.00180 - Gemini explicit 24-hr:
(5,097 × $0.03 + 691 × $0.30 + 638 × $2.50) / 1M= $0.00196
Per-query cost without cache
- Claude:
(6,241 × $1.00 + 480 × $5.00) / 1M= $0.00864 - OpenAI:
(5,657 × $2.50 + 399 × $15.00) / 1M= $0.02013 - Gemini:
(5,787 × $0.30 + 638 × $2.50) / 1M= $0.00333
1-hour window (42 queries):
| Provider | Mode | Without Cache | With Cache | Storage (1 hr) | Total (cached) | Savings |
|---|---|---|---|---|---|---|
| Claude Haiku | 1-hr TTL | $0.363 | $0.129 | $0.00 | $0.129 | $0.234 |
| OpenAI mini | auto | $0.845 | $0.347 | $0.00 | $0.347 | $0.498 |
| Gemini Flash | implicit | $0.140 | $0.077 | $0.00 | $0.077 | $0.063 |
| Gemini Flash | explicit | $0.140 | $0.082 | $0.006 | $0.088 | $0.052 |
Claude Haiku saves the most at 64.5%, followed by OpenAI mini at 58.9%. Gemini’s implicit cache delivers 45.0% savings with zero configuration overhead; explicit cache drops to 37.1% once the $0.006/hr storage cost is factored in.
24-hour window (1,000 queries):
| Provider | Mode | No Cache/day | Queries/day | Storage/day | Total/day | Annual Savings |
|---|---|---|---|---|---|---|
| Claude Haiku | 5-min TTL | $8.64 | $2.85 | $0 | $2.85 | ~$2,114 |
| Claude Haiku | 1-hr TTL | $8.64 | $2.86 | $0 | $2.86 | ~$2,110 |
| OpenAI mini | auto (up to 24 hr) | $20.13 | $7.96 | $0 | $7.96 | ~$4,442 |
| Gemini Flash | implicit (free) | $3.33 | $1.80 | $0 | $1.80 | ~$559 |
| Gemini Flash | explicit 24-hr TTL | $3.33 | $1.96 | $0.15 | $2.11 | ~$445 |
A few things stand out from this comparison. Claude’s 1-hour TTL adds only a fraction of a cent per day over the 5-minute default for high-frequency workloads, because the write cost difference ($1.25/M vs $2.00/M) only matters on the one cold-start write per day. OpenAI’s 24-hour retention costs nothing extra. And for Gemini, the implicit cache (free) actually beats the explicit 24-hour cache for this context size - the $0.15/day storage eats into the read savings. Explicit cache becomes cost-effective on Gemini only when the context is large enough (typically above 50K tokens) that the storage cost is justified by the reliable cache hit rate.
These are per-workload numbers. A production system typically has several distinct document contexts - a different legal template per client, a different codebase per repository, a different persona per product line. Each one is a separate cache entry, and the savings stack.
Choosing Your Strategy
No single approach fits every case. Here is a practical decision guide:
Use Claude’s explicit caching when
- You need full observability into cache hits and misses (the API returns exact token counts per category)
- You are managing costs tightly and want to predict billing before a call, not discover it after
- You have multiple cacheable segments (Claude supports up to 4 cache breakpoints per request)
- You need the cache to persist up to 1 hour and want guaranteed behavior
Use OpenAI’s automatic caching when
- You want zero implementation overhead - no code changes, no annotations
- You are already on OpenAI and your prompts exceed 1,024 tokens with a stable prefix
- You can accept less transparency in exchange for simplicity
- Note: OpenAI caches in 128-token blocks, so very short prompts or frequently-changing prefixes may see inconsistent savings
Use Gemini’s implicit caching when
- Cost is your primary concern - implicit cache hits are genuinely free
- You are using Gemini 2.5 Flash for tasks with repetitive context (customer support, document review)
- You want savings with no configuration at all
Use Gemini’s explicit caching when
- Your context is large (typically 50K+ tokens) where the $1.00/M tokens/hr storage cost is justified by the reliable cache hit rate
- You need deterministic cache behavior across sessions or beyond the implicit warm-up period (implicit only activates after a few repeated calls)
- You can manage a
CachedContentobject and its TTL, and need cache to persist even during low-traffic periods
Where Caching Gets Complicated in Agentic Workflows
Single-turn Q&A is the easy case. Agentic workflows surface three problems that simple prompt caching does not solve.
Problem 1: History insertion breaks the cache prefix. Every turn in a multi-step agent appends new content to the conversation. When you insert a new user message or tool result, the prefix changes - even if your system prompt is identical. A cache built on the first three turns is invalidated on the fourth. You need infrastructure that checkpoints history in a way that caches the stable portion while correctly billing the dynamic portion.
Problem 2: Context fan-out in parallel agents. When a coordinator dispatches work to five sub-agents in parallel, each sub-agent may receive an identical copy of the shared context. Without cache sharing, you pay 5× for the same tokens. With it, you pay 1×. But sharing requires your runtime to understand which calls share a logical context, not just an identical prefix string.
Problem 3: Cache efficiency is invisible. You do not know what percentage of your runs are hitting the cache versus writing it. Across hundreds of concurrent runs, cache efficiency is the difference between the cost you expected and the bill you receive. Without per-run telemetry, you are flying blind.
These are solvable problems, but they require the execution layer to participate in cache management - not just the application code.
How AGNT5 Handles This
AGNT5’s durable runtime is designed around the assumption that production workloads are multi-step, concurrent, and expensive - and that token spend needs to be observable down to the individual function call.
Stable context, automatically. AGNT5 workflows separate durable state (what the workflow knows) from transient invocation inputs (what this particular step needs). The system prompt and shared context are attached once at the worker level; each step invocation sends only the delta. This means the stable prefix is always the same across calls, and prompt caching activates reliably without application-level management.
Per-step token attribution. Every LLM call made through AGNT5’s SDK records tokens_in, tokens_out, cache_read_tokens, cache_write_tokens, and cost_usd in the run journal. You can query cache efficiency across runs, across time, across models - directly from the AGNT5 Studio or via SQL over the archived Parquet files.
SELECT
step_name,
SUM(cache_read_tokens) AS cached_tokens,
SUM(input_tokens) AS uncached_tokens,
ROUND(100.0 * SUM(cache_read_tokens) /
NULLIF(SUM(cache_read_tokens) + SUM(input_tokens), 0), 1) AS cache_hit_pct,
SUM(cost_usd) AS spend_usd
FROM journal_entries
WHERE run_id LIKE 'proj_%'
AND completed_at_ms >= $since
GROUP BY step_name
ORDER BY spend_usd DESC;Provider-agnostic. The same AGNT5 workflow definition works against Claude, OpenAI, or Gemini. Switching providers to compare cost does not require rewriting caching logic. The runtime adapts to each provider’s cache API transparently.
If you are building workflows that process repetitive context at scale - document review, code analysis, customer support, research pipelines - AGNT5 gives you prompt caching that is correct by default, observable always, and not tied to a single provider.
Summary
Prompt caching is one of the highest-leverage cost optimizations available to AI developers, and all three major providers support it. The implementation approaches are quite different.
- Claude requires explicit annotation, gives you full visibility, and is the right choice when you need predictable, observable cache behavior.
- OpenAI requires nothing - keep your prefix stable, and caching is automatic with ~75–90% savings on cached tokens.
- Gemini offers two modes: zero-cost implicit caching that activates automatically, and explicit caching for very large contexts above 32,768 tokens.
Across our benchmark - a 4,500-token legal contract analyzed five ways - prompt caching cut session costs by 23–51% depending on the provider. At 1,000 queries per day, that translates to $559–$4,442 saved annually on a single document context. The savings scale linearly with volume.
The threshold for action is low. For Claude, it is one annotation. For OpenAI, it is nothing. For Gemini implicit, it is also nothing. The question is not whether to enable prompt caching - it is whether your execution layer is giving you enough visibility to know if it is working.
Pricing data as of June 2026. All numbers from third-party pricing aggregators - verify directly on Anthropic pricing, OpenAI pricing, and Google AI pricing before making budget decisions. Production-scale calculations assume 1,000 queries/day with a 6,220-token shared context, one cache write per day, and 999 cache reads.