LLM cost reduce в 10x через caching: (1) Exact match — hash prompt → Redis, если hit возвращаем без LLM call (бесплатно, мгновенно), (2) Semantic cache — embedding prompt → similar query в vector DB → return cached answer if similarity > 0.95, (3) Provider cache: Anthropic Prompt Caching 90% cheaper на cache hit, OpenAI automatic cache. Combine all three для max savings.
Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.
hash = md5(model + prompt) → Rediscache_control: {type: "ephemeral"} на system prompt| Сценарий | Конфиг |
|---|---|
| Redis exact cache (Node) | const key = `llm:${model}:${crypto.createHash('md5').update(prompt).digest('hex')}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const response = await openai.chat.completions.create({...});
await redis.setex(key, 3600, JSON.stringify(response)); |
| Anthropic prompt cache | const response = await anthropic.messages.create({
model: 'claude-opus-4-7',
system: [
{ type: 'text', text: longSystemPrompt, cache_control: { type: 'ephemeral' } }
],
messages: [...]
});
// Cache hit: 90% cheaper, 85% faster |
| Semantic cache с Qdrant | // Check semantic cache
const emb = await embed(prompt);
const similar = await qdrant.search('llm_cache', { vector: emb, limit: 1 });
if (similar[0]?.score > 0.95) return similar[0].payload.response;
// Else call LLM + save |
| Vercel AI SDK cache | import { unstable_cache } from 'next/cache';
const cachedGenerate = unstable_cache(
async (prompt) => generateText({ model, prompt }),
['llm-generate'],
{ revalidate: 3600 }
); |
| Monitor cache metrics | // Log для analytics
cacheHits.inc({ hit: result ? 'true' : 'false' });
// Prometheus: llm_cache_hit_ratio |
30-50% хорошо для general chatbot. 70%+ — для frequent queries (FAQ, docs Q&A).
Prompts > 1024 tokens cached automatically для 5-60 минут. Cost 50% reduction. Нет explicit API.
Да. Cache на full response после completion. На второй hit можно stream из Redis с artificial delay для UX.
Time-based (TTL) — simple. Tag-based (invalidate все с tag "docs:v2") — complex. Для most LLM uses TTL достаточно.