How to Stream LLM Responses in Real Time

Igor Verentsov

How to Stream LLM Responses

By Igor Verentsov · Updated Apr 18, 2026

Key idea:

LLM streaming is the key to good UX. Without streaming the user waits 10s for a blank screen until the full response. With streaming — first token within 300-500ms. Paradigm: Server-Sent Events (SSE) or chunked HTTP. OpenAI / Anthropic / Gemini — all support stream: true. Frontend: fetch() + ReadableStream + TextDecoder → append to UI incrementally.

Below: step-by-step, working examples, common pitfalls, FAQ.

Free online tool — HTTP header checker: instant results, no signup.

Check your site →

Step-by-Step Setup

Backend: set stream: true on LLM call
Return Response with Content-Type: text/event-stream (or plain chunked)
Forward each chunk to the client: res.write("data: " + chunk + "\n\n")
Client: fetch(...).then(r => r.body.getReader())
Loop: read() → decoder.decode() → parse SSE → append to DOM
UI: show a "typing" indicator, smooth auto-scroll
Error handling: abort, network fail, token limit

Working Examples

Scenario	Config
Backend OpenAI streaming	const stream = await openai.chat.completions.create({ model: 'gpt-5', stream: true, messages: [...] }); for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content \|\| ''; res.write(`data: ${JSON.stringify({ content })}\n\n`); } res.end();
Frontend fetch streaming	`const response = await fetch('/api/ai/chat', { method: 'POST', body: JSON.stringify({ message }) }); const reader = response.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); // parse SSE lines, extract content, append to UI outputEl.textContent += content; }`
Vercel AI SDK (modern)	`import { streamText } from 'ai'; import { openai } from '@ai-sdk/openai'; const result = streamText({ model: openai('gpt-5'), messages: [...] }); for await (const chunk of result.textStream) { console.log(chunk); }`
Anthropic Claude streaming	`const stream = await anthropic.messages.stream({ model: 'claude-opus-4-7', messages: [...] }); stream.on('text', (text) => { console.log('chunk:', text); });`
SSE client (EventSource)	`const es = new EventSource('/api/ai/stream?query=hello'); es.onmessage = (event) => { const { content } = JSON.parse(event.data); outputEl.textContent += content; };`

Common Pitfalls

Buffering: nginx default buffers response. Add proxy_buffering off; on streaming location
Timeouts: do not set a short timeout (10s) on a streaming request — LLMs sometimes take 30s
Abort: user can cancel, clean up AbortController and LLM request
Retries: do not retry already streaming responses — duplicate tokens in UI
Cache: do not use intermediate cache (CloudFlare etc) for streaming endpoints

Learn more

How-to

Glossary

What is CDC (Change Data Capture)

Research

Frequently Asked Questions

SSE vs WebSocket?

SSE: one-way server→client, HTTP-based, proxy-friendly. WebSocket: duplex, overkill for LLM (no client→server streaming needed).

First token latency?

Typically 300-800ms. Factors: server region, model size, prompt caching. OpenAI with cache — 150ms.

How to minimise delay?

TTFT (time to first token): cache prompt prefix, short prompts, smaller model on low-latency path. Prompt cache cuts 10x.

Monitor latency?

TTFT + tokens/sec in analytics. <a href="/en/check">Enterno HTTP checker</a> for the general endpoint. LangSmith / LangFuse for LLM-specific traces.

Try the live tool that powered this guide

Free plan — 10 monitors, checks every 5 min, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing