How to Cache LLM API Calls

Igor Verentsov

By Igor Verentsov · Updated Apr 18, 2026

Key idea:

Reduce LLM cost 10x via caching: (1) Exact match — hash prompt → Redis, if hit return without LLM call (free, instant), (2) Semantic cache — embed prompt → similar query in vector DB → return cached answer if similarity > 0.95, (3) Provider cache: Anthropic Prompt Caching 90% cheaper on cache hit, OpenAI automatic cache. Combine all three for max savings.

Below: step-by-step, working examples, common pitfalls, FAQ.

Free online tool — HTTP header checker: instant results, no signup.

Check your site →

Step-by-Step Setup

Exact cache: hash = md5(model + prompt) → Redis
TTL: 1 hour default, longer for stable prompts
Semantic cache: embed prompt → ANN search in vector DB
Threshold: similarity > 0.95 for reliability
Anthropic: cache_control: {type: "ephemeral"} on system prompt
OpenAI automatic cache: prefix > 1024 tokens → cached automatically
Monitor hit rate: > 30% cache hit ratio — success

Working Examples

Scenario	Config
Redis exact cache (Node)	const key = `llm:${model}:${crypto.createHash('md5').update(prompt).digest('hex')}`; const cached = await redis.get(key); if (cached) return JSON.parse(cached); const response = await openai.chat.completions.create({...}); await redis.setex(key, 3600, JSON.stringify(response));
Anthropic prompt cache	`const response = await anthropic.messages.create({ model: 'claude-opus-4-7', system: [ { type: 'text', text: longSystemPrompt, cache_control: { type: 'ephemeral' } } ], messages: [...] }); // Cache hit: 90% cheaper, 85% faster`
Semantic cache with Qdrant	`// Check semantic cache const emb = await embed(prompt); const similar = await qdrant.search('llm_cache', { vector: emb, limit: 1 }); if (similar[0]?.score > 0.95) return similar[0].payload.response; // Else call LLM + save`
Vercel AI SDK cache	`import { unstable_cache } from 'next/cache'; const cachedGenerate = unstable_cache( async (prompt) => generateText({ model, prompt }), ['llm-generate'], { revalidate: 3600 } );`
Monitor cache metrics	`// Log for analytics cacheHits.inc({ hit: result ? 'true' : 'false' }); // Prometheus: llm_cache_hit_ratio`

Common Pitfalls

Non-deterministic prompts (current date, user ID in prompt) → cache never hits. Strip dynamic parts
Temperature > 0 = random outputs. Cache assumes deterministic — temperature=0 for cacheable calls
Long TTL (24h+) for stale data — news, user data should have short TTL
Semantic cache false positives: embedding similarity 0.9 can mean different intent. 0.95+ is safer
Anthropic cache minimum 1024 tokens for system prompt — smaller not cacheable

Learn more

How-to

Glossary

What is CDC (Change Data Capture)

Research

Frequently Asked Questions

Cache hit rate target?

30-50% is good for a general chatbot. 70%+ — for frequent queries (FAQ, docs Q&A).

How does OpenAI automatic cache work?

Prompts > 1024 tokens are cached automatically for 5-60 minutes. Cost cut 50%. No explicit API.

Cache for streaming?

Yes. Cache the full response after completion. On the second hit you can stream from Redis with artificial delay for UX.

Invalidation strategy?

Time-based (TTL) — simple. Tag-based (invalidate all with tag "docs:v2") — complex. For most LLM uses TTL is enough.

Try the live tool that powered this guide

Free plan — 10 monitors, checks every 5 min, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing