LLM streaming — ключ к good UX. Без streaming пользователь ждёт 10s пустой экран до full response. С streaming — первый токен за 300-500ms. Paradigm: Server-Sent Events (SSE) или chunked HTTP. OpenAI / Anthropic / Gemini — все support stream: true. Frontend: fetch() + ReadableStream + TextDecoder → append к UI incrementally.
Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.
stream: true в LLM callres.write("data: " + chunk + "\n\n")fetch(...).then(r => r.body.getReader())| Сценарий | Конфиг |
|---|---|
| Backend OpenAI streaming | const stream = await openai.chat.completions.create({
model: 'gpt-5',
stream: true,
messages: [...]
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
res.write(`data: ${JSON.stringify({ content })}\n\n`);
}
res.end(); |
| Frontend fetch streaming | const response = await fetch('/api/ai/chat', { method: 'POST', body: JSON.stringify({ message }) });
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
// parse SSE lines, extract content, append к UI
outputEl.textContent += content;
} |
| Vercel AI SDK (modern) | import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
const result = streamText({
model: openai('gpt-5'),
messages: [...]
});
for await (const chunk of result.textStream) {
console.log(chunk);
} |
| Anthropic Claude streaming | const stream = await anthropic.messages.stream({
model: 'claude-opus-4-7',
messages: [...]
});
stream.on('text', (text) => {
console.log('chunk:', text);
}); |
| SSE client (EventSource) | const es = new EventSource('/api/ai/stream?query=hello');
es.onmessage = (event) => {
const { content } = JSON.parse(event.data);
outputEl.textContent += content;
}; |
proxy_buffering off; для streaming locationSSE: one-way server→client, HTTP-based, proxies friendly. WebSocket: duplex, overkill для LLM (не нужно client→server streaming).
Typical 300-800ms. Factors: server region, model size, prompt caching. OpenAI с cache — 150ms.
TTFT (time to first token): cache prompt prefix, short prompts, use smaller model для low-latency path. Prompt cache 10x reduce.
TTFT + tokens/sec в analytics. <a href="/check">Enterno HTTP checker</a> для general endpoint. LangSmith / LangFuse для LLM-specific traces.