Skip to content

How to Stream LLM Responses

Key idea:

LLM streaming is the key to good UX. Without streaming the user waits 10s for a blank screen until the full response. With streaming — first token within 300-500ms. Paradigm: Server-Sent Events (SSE) or chunked HTTP. OpenAI / Anthropic / Gemini — all support stream: true. Frontend: fetch() + ReadableStream + TextDecoder → append to UI incrementally.

Below: step-by-step, working examples, common pitfalls, FAQ.

Try it now — free →

Step-by-Step Setup

  1. Backend: set stream: true on LLM call
  2. Return Response with Content-Type: text/event-stream (or plain chunked)
  3. Forward each chunk to the client: res.write("data: " + chunk + "\n\n")
  4. Client: fetch(...).then(r => r.body.getReader())
  5. Loop: read() → decoder.decode() → parse SSE → append to DOM
  6. UI: show a "typing" indicator, smooth auto-scroll
  7. Error handling: abort, network fail, token limit

Working Examples

ScenarioConfig
Backend OpenAI streamingconst stream = await openai.chat.completions.create({ model: 'gpt-5', stream: true, messages: [...] }); for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content || ''; res.write(`data: ${JSON.stringify({ content })}\n\n`); } res.end();
Frontend fetch streamingconst response = await fetch('/api/ai/chat', { method: 'POST', body: JSON.stringify({ message }) }); const reader = response.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); // parse SSE lines, extract content, append to UI outputEl.textContent += content; }
Vercel AI SDK (modern)import { streamText } from 'ai'; import { openai } from '@ai-sdk/openai'; const result = streamText({ model: openai('gpt-5'), messages: [...] }); for await (const chunk of result.textStream) { console.log(chunk); }
Anthropic Claude streamingconst stream = await anthropic.messages.stream({ model: 'claude-opus-4-7', messages: [...] }); stream.on('text', (text) => { console.log('chunk:', text); });
SSE client (EventSource)const es = new EventSource('/api/ai/stream?query=hello'); es.onmessage = (event) => { const { content } = JSON.parse(event.data); outputEl.textContent += content; };

Common Pitfalls

  • Buffering: nginx default buffers response. Add proxy_buffering off; on streaming location
  • Timeouts: do not set a short timeout (10s) on a streaming request — LLMs sometimes take 30s
  • Abort: user can cancel, clean up AbortController and LLM request
  • Retries: do not retry already streaming responses — duplicate tokens in UI
  • Cache: do not use intermediate cache (CloudFlare etc) for streaming endpoints

Learn more

Frequently Asked Questions

SSE vs WebSocket?

SSE: one-way server→client, HTTP-based, proxy-friendly. WebSocket: duplex, overkill for LLM (no client→server streaming needed).

First token latency?

Typically 300-800ms. Factors: server region, model size, prompt caching. OpenAI with cache — 150ms.

How to minimise delay?

TTFT (time to first token): cache prompt prefix, short prompts, smaller model on low-latency path. Prompt cache cuts 10x.

Monitor latency?

TTFT + tokens/sec in analytics. <a href="/en/check">Enterno HTTP checker</a> for the general endpoint. LangSmith / LangFuse for LLM-specific traces.