Как кэшировать LLM API calls

Anatoly Oshmanovsky

Автор: Anatoly Oshmanovsky · Обновлено 18 апреля 2026

Коротко:

LLM cost reduce в 10x через caching: (1) Exact match — hash prompt → Redis, если hit возвращаем без LLM call (бесплатно, мгновенно), (2) Semantic cache — embedding prompt → similar query в vector DB → return cached answer if similarity > 0.95, (3) Provider cache: Anthropic Prompt Caching 90% cheaper на cache hit, OpenAI automatic cache. Combine all three для max savings.

Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.

Попробовать бесплатно →

Пошаговая настройка

Exact cache: hash = md5(model + prompt) → Redis
TTL: 1 час default, больше для stable prompts
Semantic cache: embed prompt → ANN search в vector DB
Threshold: similarity > 0.95 для reliability
Anthropic: cache_control: {type: "ephemeral"} на system prompt
OpenAI automatic cache: prefix > 1024 tokens → cached автоматически
Monitor hit rate: > 30% cache hit ratio — success

Рабочие примеры

Сценарий	Конфиг
Redis exact cache (Node)	const key = `llm:${model}:${crypto.createHash('md5').update(prompt).digest('hex')}`; const cached = await redis.get(key); if (cached) return JSON.parse(cached); const response = await openai.chat.completions.create({...}); await redis.setex(key, 3600, JSON.stringify(response));
Anthropic prompt cache	`const response = await anthropic.messages.create({ model: 'claude-opus-4-7', system: [ { type: 'text', text: longSystemPrompt, cache_control: { type: 'ephemeral' } } ], messages: [...] }); // Cache hit: 90% cheaper, 85% faster`
Semantic cache с Qdrant	`// Check semantic cache const emb = await embed(prompt); const similar = await qdrant.search('llm_cache', { vector: emb, limit: 1 }); if (similar[0]?.score > 0.95) return similar[0].payload.response; // Else call LLM + save`
Vercel AI SDK cache	`import { unstable_cache } from 'next/cache'; const cachedGenerate = unstable_cache( async (prompt) => generateText({ model, prompt }), ['llm-generate'], { revalidate: 3600 } );`
Monitor cache metrics	`// Log для analytics cacheHits.inc({ hit: result ? 'true' : 'false' }); // Prometheus: llm_cache_hit_ratio`

Типичные ошибки

Non-deterministic prompts (current date, user ID в prompt) → cache never hits. Strip dynamic parts
Temperature > 0 = random outputs. Cache assumes deterministic — temperature=0 для cacheable calls
Long TTL (24h+) для stale data — news, user data должны иметь short TTL
Semantic cache false positives: embedding similarity 0.9 может быть разным intent. 0.95+ safer
Anthropic cache min 1024 tokens для system prompt — меньше не cacheable

Больше по теме

Гайды

Глоссарий

Исследования

Часто задаваемые вопросы

Cache hit rate target?

30-50% хорошо для general chatbot. 70%+ — для frequent queries (FAQ, docs Q&A).

OpenAI automatic cache — как работает?

Prompts > 1024 tokens cached automatically для 5-60 минут. Cost 50% reduction. Нет explicit API.

Кэш для streaming?

Да. Cache на full response после completion. На второй hit можно stream из Redis с artificial delay для UX.

Invalidation strategy?

Time-based (TTL) — simple. Tag-based (invalidate все с tag "docs:v2") — complex. Для most LLM uses TTL достаточно.

Запустить инструмент, который описан в этой статье

Бесплатный тариф — 20 мониторов, проверки раз в 5 минут, без карты. Платные тарифы — интервал от 1 минуты и проверки из нескольких регионов.

Начать бесплатно Тарифы