Reduce LLM cost 10x via caching: (1) Exact match — hash prompt → Redis, if hit return without LLM call (free, instant), (2) Semantic cache — embed prompt → similar query in vector DB → return cached answer if similarity > 0.95, (3) Provider cache: Anthropic Prompt Caching 90% cheaper on cache hit, OpenAI automatic cache. Combine all three for max savings.
Below: step-by-step, working examples, common pitfalls, FAQ.
hash = md5(model + prompt) → Rediscache_control: {type: "ephemeral"} on system prompt| Scenario | Config |
|---|---|
| Redis exact cache (Node) | const key = `llm:${model}:${crypto.createHash('md5').update(prompt).digest('hex')}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const response = await openai.chat.completions.create({...});
await redis.setex(key, 3600, JSON.stringify(response)); |
| Anthropic prompt cache | const response = await anthropic.messages.create({
model: 'claude-opus-4-7',
system: [
{ type: 'text', text: longSystemPrompt, cache_control: { type: 'ephemeral' } }
],
messages: [...]
});
// Cache hit: 90% cheaper, 85% faster |
| Semantic cache with Qdrant | // Check semantic cache
const emb = await embed(prompt);
const similar = await qdrant.search('llm_cache', { vector: emb, limit: 1 });
if (similar[0]?.score > 0.95) return similar[0].payload.response;
// Else call LLM + save |
| Vercel AI SDK cache | import { unstable_cache } from 'next/cache';
const cachedGenerate = unstable_cache(
async (prompt) => generateText({ model, prompt }),
['llm-generate'],
{ revalidate: 3600 }
); |
| Monitor cache metrics | // Log for analytics
cacheHits.inc({ hit: result ? 'true' : 'false' });
// Prometheus: llm_cache_hit_ratio |
30-50% is good for a general chatbot. 70%+ — for frequent queries (FAQ, docs Q&A).
Prompts > 1024 tokens are cached automatically for 5-60 minutes. Cost cut 50%. No explicit API.
Yes. Cache the full response after completion. On the second hit you can stream from Redis with artificial delay for UX.
Time-based (TTL) — simple. Tag-based (invalidate all with tag "docs:v2") — complex. For most LLM uses TTL is enough.