LLM streaming is the key to good UX. Without streaming the user waits 10s for a blank screen until the full response. With streaming — first token within 300-500ms. Paradigm: Server-Sent Events (SSE) or chunked HTTP. OpenAI / Anthropic / Gemini — all support stream: true. Frontend: fetch() + ReadableStream + TextDecoder → append to UI incrementally.
Below: step-by-step, working examples, common pitfalls, FAQ.
stream: true on LLM callres.write("data: " + chunk + "\n\n")fetch(...).then(r => r.body.getReader())| Scenario | Config |
|---|---|
| Backend OpenAI streaming | const stream = await openai.chat.completions.create({
model: 'gpt-5',
stream: true,
messages: [...]
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
res.write(`data: ${JSON.stringify({ content })}\n\n`);
}
res.end(); |
| Frontend fetch streaming | const response = await fetch('/api/ai/chat', { method: 'POST', body: JSON.stringify({ message }) });
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
// parse SSE lines, extract content, append to UI
outputEl.textContent += content;
} |
| Vercel AI SDK (modern) | import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
const result = streamText({
model: openai('gpt-5'),
messages: [...]
});
for await (const chunk of result.textStream) {
console.log(chunk);
} |
| Anthropic Claude streaming | const stream = await anthropic.messages.stream({
model: 'claude-opus-4-7',
messages: [...]
});
stream.on('text', (text) => {
console.log('chunk:', text);
}); |
| SSE client (EventSource) | const es = new EventSource('/api/ai/stream?query=hello');
es.onmessage = (event) => {
const { content } = JSON.parse(event.data);
outputEl.textContent += content;
}; |
proxy_buffering off; on streaming locationSSE: one-way server→client, HTTP-based, proxy-friendly. WebSocket: duplex, overkill for LLM (no client→server streaming needed).
Typically 300-800ms. Factors: server region, model size, prompt caching. OpenAI with cache — 150ms.
TTFT (time to first token): cache prompt prefix, short prompts, smaller model on low-latency path. Prompt cache cuts 10x.
TTFT + tokens/sec in analytics. <a href="/en/check">Enterno HTTP checker</a> for the general endpoint. LangSmith / LangFuse for LLM-specific traces.