Skip to content

How to Cache LLM API Calls

Key idea:

Reduce LLM cost 10x via caching: (1) Exact match — hash prompt → Redis, if hit return without LLM call (free, instant), (2) Semantic cache — embed prompt → similar query in vector DB → return cached answer if similarity > 0.95, (3) Provider cache: Anthropic Prompt Caching 90% cheaper on cache hit, OpenAI automatic cache. Combine all three for max savings.

Below: step-by-step, working examples, common pitfalls, FAQ.

Try it now — free →

Step-by-Step Setup

  1. Exact cache: hash = md5(model + prompt) → Redis
  2. TTL: 1 hour default, longer for stable prompts
  3. Semantic cache: embed prompt → ANN search in vector DB
  4. Threshold: similarity > 0.95 for reliability
  5. Anthropic: cache_control: {type: "ephemeral"} on system prompt
  6. OpenAI automatic cache: prefix > 1024 tokens → cached automatically
  7. Monitor hit rate: > 30% cache hit ratio — success

Working Examples

ScenarioConfig
Redis exact cache (Node)const key = `llm:${model}:${crypto.createHash('md5').update(prompt).digest('hex')}`; const cached = await redis.get(key); if (cached) return JSON.parse(cached); const response = await openai.chat.completions.create({...}); await redis.setex(key, 3600, JSON.stringify(response));
Anthropic prompt cacheconst response = await anthropic.messages.create({ model: 'claude-opus-4-7', system: [ { type: 'text', text: longSystemPrompt, cache_control: { type: 'ephemeral' } } ], messages: [...] }); // Cache hit: 90% cheaper, 85% faster
Semantic cache with Qdrant// Check semantic cache const emb = await embed(prompt); const similar = await qdrant.search('llm_cache', { vector: emb, limit: 1 }); if (similar[0]?.score > 0.95) return similar[0].payload.response; // Else call LLM + save
Vercel AI SDK cacheimport { unstable_cache } from 'next/cache'; const cachedGenerate = unstable_cache( async (prompt) => generateText({ model, prompt }), ['llm-generate'], { revalidate: 3600 } );
Monitor cache metrics// Log for analytics cacheHits.inc({ hit: result ? 'true' : 'false' }); // Prometheus: llm_cache_hit_ratio

Common Pitfalls

  • Non-deterministic prompts (current date, user ID in prompt) → cache never hits. Strip dynamic parts
  • Temperature > 0 = random outputs. Cache assumes deterministic — temperature=0 for cacheable calls
  • Long TTL (24h+) for stale data — news, user data should have short TTL
  • Semantic cache false positives: embedding similarity 0.9 can mean different intent. 0.95+ is safer
  • Anthropic cache minimum 1024 tokens for system prompt — smaller not cacheable

Learn more

Frequently Asked Questions

Cache hit rate target?

30-50% is good for a general chatbot. 70%+ — for frequent queries (FAQ, docs Q&A).

How does OpenAI automatic cache work?

Prompts > 1024 tokens are cached automatically for 5-60 minutes. Cost cut 50%. No explicit API.

Cache for streaming?

Yes. Cache the full response after completion. On the second hit you can stream from Redis with artificial delay for UX.

Invalidation strategy?

Time-based (TTL) — simple. Tag-based (invalidate all with tag "docs:v2") — complex. For most LLM uses TTL is enough.