TLDR: Time to first token beats total generation time — get something on screen in 800ms and the wait feels fast. SSE, Angular signals, the blinking cursor pattern, streaming markdown buffering, and a shipping checklist. One rule I won't compromise on: never throw away partial content on error.
A spinner for 8 seconds is a dead UI. The same 8 seconds with tokens streaming in — words appearing as the model generates them — feels fast. This isn't a perception trick. It's a fundamental shift in how users experience latency.
I've shipped three AI chat interfaces and the streaming UX work is consistently what separates the ones that feel polished from the ones that feel clunky. Here's everything I've learned about doing it right.
Why streaming changes the experience
LLMs generate tokens sequentially. Without streaming, the user waits for the entire response before seeing anything. With streaming:
- Time to first token (TTFT) drops to ~300–800ms
- Users can start reading while the model is still writing
- Users can interrupt early if the response is going the wrong direction
- The interface feels alive, not like a broken loading state
Users perceive streamed responses as roughly 3× faster even when total generation time is identical.
The number that matters is TTFT, not total generation time. Get something on screen in under 800ms and the experience feels responsive regardless of how long the full response takes.
The Server-Sent Events pattern
The standard transport for LLM streaming is SSE — a unidirectional HTTP stream from server to client. It's simpler than WebSockets for this use case because you only need one direction.
Backend (Node.js / Express)
import Anthropic from "@anthropic-ai/sdk";
import express from "express";
const client = new Anthropic();
const app = express();
app.use(express.json());
app.post("/api/chat", async (req, res) => {
const { messages } = req.body;
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
const stream = await client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 2048,
messages,
});
for await (const chunk of stream) {
if (
chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta"
) {
res.write(`data: ${JSON.stringify({ text: chunk.delta.text })}\n\n`);
}
}
res.write("data: [DONE]\n\n");
res.end();
});
Frontend (React)
async function streamChat(messages: Message[], onChunk: (text: string) => void) {
const response = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const lines = decoder.decode(value).split("\n");
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6);
if (data === "[DONE]") return;
const { text } = JSON.parse(data);
onChunk(text);
}
}
}
Angular implementation with signals
Angular signals pair naturally with streaming — each token is a fine-grained reactive update, not a full re-render trigger.
@Component({
selector: "app-chat",
template: `
<div class="message-stream">
{{ streamingContent() }}
@if (isStreaming()) {
<span class="cursor">▋</span>
}
</div>
<button (click)="stopStream()" [disabled]="!isStreaming()">Stop</button>
`,
})
export class ChatComponent {
streamingContent = signal("");
isStreaming = signal(false);
private abortController?: AbortController;
async sendMessage(userMessage: string) {
this.streamingContent.set("");
this.isStreaming.set(true);
this.abortController = new AbortController();
try {
const response = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages: [{ role: "user", content: userMessage }] }),
signal: this.abortController.signal,
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const lines = decoder.decode(value).split("\n");
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6);
if (data === "[DONE]") break;
const { text } = JSON.parse(data);
// Signal update — only this span re-renders, nothing else
this.streamingContent.update((prev) => prev + text);
}
}
} finally {
this.isStreaming.set(false);
}
}
stopStream() {
this.abortController?.abort();
}
}
The key: signal.update() on every token triggers a targeted micro-update, not a full component re-render. On a fast model this can be 30+ updates per second. Signals handle this without any visible jank.
The cursor pattern — one detail that changes everything
Users expect a blinking cursor during generation. It signals "still thinking" and makes the streaming feel intentional rather than broken. This one detail dramatically improves perceived quality.
@keyframes blink {
0%, 100% { opacity: 1; }
50% { opacity: 0; }
}
.cursor {
display: inline-block;
width: 2px;
height: 1.1em;
background: currentColor;
margin-left: 1px;
animation: blink 1s step-start infinite;
vertical-align: text-bottom;
}
Show it when isStreaming is true, hide it on completion. Add a short fade-out transition on hide so the cursor disappearing doesn't feel abrupt.
Streaming structured data
Plain text is easy. The challenge is when you need to stream structured output — partial JSON, markdown with headings, or a mix of text and tool calls.
Streaming Markdown
Render markdown incrementally using a library that handles partial input gracefully. The naive approach flickers as partial markdown is parsed:
import { marked } from "marked";
// Naive: flickers on partial markdown
const html = marked(partialMarkdown);
// Better: buffer until a natural break point
function shouldFlush(buffer: string): boolean {
return buffer.includes("\n\n") || buffer.endsWith("```\n") || /[.!?]\s*$/.test(buffer);
}
Buffer tokens until you hit a paragraph break, closing code fence, or end of sentence before re-rendering. Users don't notice the slight delay, and you avoid the flickering partial-parse artifacts.
Streaming JSON
Buffer until valid JSON is parseable:
let jsonBuffer = "";
onChunk((text) => {
jsonBuffer += text;
try {
const parsed = JSON.parse(jsonBuffer);
updateUI(parsed);
} catch {
// Still accumulating — do nothing
}
});
Error handling and interruption
Always preserve partial content on error. The worst streaming UX I've seen throws away the entire half-generated response when something goes wrong — the user just sees the input go blank. Never do this.
async function streamWithErrorHandling(...) {
try {
await streamChat(messages, onChunk);
} catch (error) {
if (error.name === "AbortError") {
// User cancelled — show partial response with indicator
appendToMessage("\n\n*[Stopped]*");
} else if (error.status === 429) {
showRetryAfter(error.headers["retry-after"]);
} else {
showError("Something went wrong. Partial response above.");
}
}
}
Performance: don't re-render on every token
The most common streaming performance mistake: updating state in a way that re-renders the full component tree on every token.
// ❌ Triggers expensive re-renders on every character
const [content, setContent] = useState("");
onChunk((text) => setContent((prev) => prev + text));
// ✅ Direct DOM mutation for the stream target — zero React overhead
const streamRef = useRef<HTMLSpanElement>(null);
onChunk((text) => {
if (streamRef.current) {
streamRef.current.textContent += text;
}
});
Or use Angular signals as shown earlier — they target the exact DOM node without triggering parent change detection.
UX checklist before shipping
I run through these before every streamed AI feature ships:
- Cursor visible during streaming, hidden on completion
- Stop button — users must be able to interrupt generation
- Partial response preserved if stopped or errored
- Auto-scroll follows the stream (but stops if user scrolled up)
- Loading state for TTFT — show something in the first 300ms before tokens arrive
- Mobile tested — SSE connections behave differently on flaky mobile networks
- Reconnect logic — implement exponential backoff for dropped SSE connections
The stop button is non-negotiable. If a model starts going in the wrong direction, users need an escape. An AI interface without an interrupt mechanism is frustrating in a way that kills trust fast.