LLM inference cost 2026 снижается в ~8x YoY. GPT-5 ($5 input/$15 output per 1M) — 2x дешевле чем GPT-4 (2023) при лучшем качестве. Llama 3 70B через Together.ai — $0.88/1M (8x дешевле GPT-5). Self-host Llama 3 + H100 $3/hour = $0.001 per 1M tokens (50x дешевле). Тренд: API prices дешевеют, hardware быстрее, quantization INT4. 2027 прогноз: GPT-5-class quality за $0.50/1M.
Ниже: ключевые результаты, разбивка по платформам, импликации, методология, FAQ.
| Метрика | Pass/значение | Медиана | p75 |
|---|---|---|---|
| GPT-5 / GPT-4 price ratio | 50% ($5 vs $10) | — | — |
| Llama 3 70B (Together.ai) | $0.88/1M | 0.88 | — |
| Self-host Llama 3 70B (H100) | $0.05/1M | 0.05 | — |
| Median cost per query (RAG app) | $0.001 | 0.001 | 0.005 |
| Cache hit ratio (pre → saved) | 35% | — | — |
| YoY cost decline | ~8x | — | — |
| TTFT (time to first token) | 320ms median | 320 | 620 |
| Tokens/sec (Groq LPU) | 500+ | 500 | 750 |
| Платформа | Доля | Деталь | — |
|---|---|---|---|
| OpenAI GPT-5 | Frontier | $5/$15 per 1M | — |
| Claude Opus 4.7 | Frontier | $15/$75 per 1M | — |
| Gemini 2.5 Pro | Frontier | $2/$10 per 1M | — |
| Llama 3 70B (Together) | Mid-tier | $0.88/$0.88 per 1M | — |
| Groq Llama 3 70B (LPU) | Mid-tier | $0.59/$0.79 per 1M | — |
| Self-host Llama 3 70B H100 | DIY | $0.05 per 1M (amortized) | — |
Public pricing pages (Mar 2026) + usage data from 500 apps + Groq / Together benchmarks. Trailing 12 months price tracking.
>10M tokens/day при постоянной нагрузке. 1 H100 $3/h × 24 × 30 = $2,160/мес = ~2,4B tokens throughput.
Mini: $0.15/$0.60. 25x дешевле GPT-5. Quality: 70-85% на most tasks. Для chatbot / classification / simple extraction — используйте mini.
Anthropic cache 90% cheaper на hit. OpenAI automatic 50% cheaper. 35% cache hit = 30%+ cost reduction.
Per-provider dashboard + app-level tagging через X-Project header. Anomalies → alert (daily spend > threshold).