logo

LLM Streaming

@humanspeak/svelte-markdown handles real-time streaming from Large Language Models (LLMs) like ChatGPT, Claude, Gemini, and other AI assistants out of the box. As tokens arrive via Server-Sent Events (SSE) or WebSocket connections, you can write them directly to the component instance with writeChunk() and the rendered markdown updates instantly.

How It Works

LLM APIs stream responses token-by-token. Each token is a small chunk of text — sometimes a word, sometimes a partial word, sometimes punctuation or whitespace. The typical integration pattern is:

  1. The LLM API sends tokens via Server-Sent Events (SSE) or a streaming HTTP response.
  2. Your application accumulates tokens into a growing markdown string, either manually or through writeChunk().
  3. SvelteMarkdown re-parses and re-renders the full source on each update.
  4. Svelte’s fine-grained reactivity ensures only the changed DOM nodes are updated.

The component works reactively by default. For best streaming performance, pass streaming={true} to enable smart token diffing — the component re-parses the full source for correctness but only updates DOM nodes for tokens that actually changed, keeping per-update cost under 2ms regardless of document size.

Note: streaming is automatically disabled when async extensions (e.g., Mermaid) are used, since async walkTokens callbacks are incompatible with the synchronous diffing path. A console warning is logged in this case.

Basic Usage

Preferred: imperative chunk writes

<script lang="ts">
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import type { StreamingChunk } from '@humanspeak/svelte-markdown'

    let markdown:
        | {
              writeChunk: (chunk: StreamingChunk) => void
              resetStream: (nextSource?: string) => void
          }
        | undefined

    async function streamFromAPI() {
        const response = await fetch('/api/chat', { method: 'POST', body: '...' })
        if (!response.body) throw new Error('No response body')
        const reader = response.body.getReader()
        const decoder = new TextDecoder()

        markdown?.resetStream('')

        while (true) {
            const { done, value } = await reader.read()
            if (done) break
            markdown?.writeChunk(decoder.decode(value, { stream: true }))
        }
    }
</script>

<SvelteMarkdown bind:this={markdown} source="" streaming={true} />
<script lang="ts">
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import type { StreamingChunk } from '@humanspeak/svelte-markdown'

    let markdown:
        | {
              writeChunk: (chunk: StreamingChunk) => void
              resetStream: (nextSource?: string) => void
          }
        | undefined

    async function streamFromAPI() {
        const response = await fetch('/api/chat', { method: 'POST', body: '...' })
        if (!response.body) throw new Error('No response body')
        const reader = response.body.getReader()
        const decoder = new TextDecoder()

        markdown?.resetStream('')

        while (true) {
            const { done, value } = await reader.read()
            if (done) break
            markdown?.writeChunk(decoder.decode(value, { stream: true }))
        }
    }
</script>

<SvelteMarkdown bind:this={markdown} source="" streaming={true} />

writeChunk() accepts:

  • string for append mode
  • { value, offset } for websocket-style offset patches

The first successful write locks the stream into one mode until resetStream() or a source prop reset. If an offset patch skips ahead, missing positions are padded with spaces.

With websocket-style offset chunks

markdown?.writeChunk({ value: 'Hel', offset: 0 })
markdown?.writeChunk({ value: 'lo', offset: 3 })
markdown?.writeChunk({ value: ' world', offset: 5 })
markdown?.writeChunk({ value: 'Hel', offset: 0 })
markdown?.writeChunk({ value: 'lo', offset: 3 })
markdown?.writeChunk({ value: ' world', offset: 5 })

Offset chunks use overwrite semantics, not insert semantics. Each patch writes value starting at offset, preserves any trailing content after the overwritten span, and pads skipped gaps with spaces.

This means out-of-order websocket delivery is fine:

markdown?.writeChunk({ value: ' world', offset: 5 })
markdown?.writeChunk({ value: 'Hello', offset: 0 })
markdown?.writeChunk({ value: ' world', offset: 5 })
markdown?.writeChunk({ value: 'Hello', offset: 0 })

There is no delete or truncate behavior in offset mode, and offsets must be non-negative safe integers.

Like append mode, the first successful write locks the stream shape until resetStream() or a source prop reset.

SDK Examples

Choose between chunked mode (imperative writeChunk() API) and concat mode (reactive source += chunk pattern):

With the Anthropic SDK (Claude)

<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import Anthropic from '@anthropic-ai/sdk'

    let markdown

    async function streamResponse(prompt) {
        const client = new Anthropic()

        const stream = client.messages.stream({
            model: 'claude-sonnet-4-20250514',
            max_tokens: 1024,
            messages: [{ role: 'user', content: prompt }]
        })

        markdown?.resetStream('')

        for await (const event of stream) {
            if (
                event.type === 'content_block_delta' &&
                event.delta.type === 'text_delta'
            ) {
                markdown?.writeChunk(event.delta.text)
            }
        }
    }
</script>

<SvelteMarkdown bind:this={markdown} source="" streaming={true} />
<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import Anthropic from '@anthropic-ai/sdk'

    let markdown

    async function streamResponse(prompt) {
        const client = new Anthropic()

        const stream = client.messages.stream({
            model: 'claude-sonnet-4-20250514',
            max_tokens: 1024,
            messages: [{ role: 'user', content: prompt }]
        })

        markdown?.resetStream('')

        for await (const event of stream) {
            if (
                event.type === 'content_block_delta' &&
                event.delta.type === 'text_delta'
            ) {
                markdown?.writeChunk(event.delta.text)
            }
        }
    }
</script>

<SvelteMarkdown bind:this={markdown} source="" streaming={true} />

With the OpenAI SDK (ChatGPT)

<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import OpenAI from 'openai'

    let markdown

    async function streamResponse(prompt) {
        const client = new OpenAI()

        const stream = await client.chat.completions.create({
            model: 'gpt-4o',
            messages: [{ role: 'user', content: prompt }],
            stream: true
        })

        markdown?.resetStream('')

        for await (const chunk of stream) {
            const delta = chunk.choices[0]?.delta?.content
            if (delta) {
                markdown?.writeChunk(delta)
            }
        }
    }
</script>

<SvelteMarkdown bind:this={markdown} source="" streaming={true} />
<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import OpenAI from 'openai'

    let markdown

    async function streamResponse(prompt) {
        const client = new OpenAI()

        const stream = await client.chat.completions.create({
            model: 'gpt-4o',
            messages: [{ role: 'user', content: prompt }],
            stream: true
        })

        markdown?.resetStream('')

        for await (const chunk of stream) {
            const delta = chunk.choices[0]?.delta?.content
            if (delta) {
                markdown?.writeChunk(delta)
            }
        }
    }
</script>

<SvelteMarkdown bind:this={markdown} source="" streaming={true} />

With fetch and Server-Sent Events

<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'

    let markdown

    async function streamFromAPI(prompt) {
        const response = await fetch('/api/chat', {
            method: 'POST',
            body: JSON.stringify({ prompt }),
            headers: { 'Content-Type': 'application/json' }
        })

        if (!response.ok) throw new Error(`HTTP ${response.status}`)

        const reader = response.body.getReader()
        const decoder = new TextDecoder()

        markdown?.resetStream('')

        try {
            while (true) {
                const { done, value } = await reader.read()
                if (done) break
                markdown?.writeChunk(decoder.decode(value, { stream: true }))
            }
        } finally {
            reader.releaseLock()
        }
    }
</script>

<SvelteMarkdown bind:this={markdown} source="" streaming={true} />
<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'

    let markdown

    async function streamFromAPI(prompt) {
        const response = await fetch('/api/chat', {
            method: 'POST',
            body: JSON.stringify({ prompt }),
            headers: { 'Content-Type': 'application/json' }
        })

        if (!response.ok) throw new Error(`HTTP ${response.status}`)

        const reader = response.body.getReader()
        const decoder = new TextDecoder()

        markdown?.resetStream('')

        try {
            while (true) {
                const { done, value } = await reader.read()
                if (done) break
                markdown?.writeChunk(decoder.decode(value, { stream: true }))
            }
        } finally {
            reader.releaseLock()
        }
    }
</script>

<SvelteMarkdown bind:this={markdown} source="" streaming={true} />

Performance Characteristics

We measured render performance across different streaming speeds and chunking strategies using the interactive streaming demo:

Streaming SpeedChunk ModeAvg RenderPeak RenderDropped Frames
30 words/secWord~3ms~11ms0
100 chars/secCharacter~4ms~21ms0
50 words/secWord~3ms~12ms0

All render times stay well under the 16.7ms frame budget (60fps), meaning the browser has time to paint every frame without jank. Even at 100 characters per second in character mode (a worst-case scenario far beyond real LLM speeds), average render time remains under 5ms.

How Render Time Scales

Render time grows linearly with document length because the full markdown source is re-parsed on each update. For a typical LLM response (~2,000 characters), the overhead is negligible:

  • 0-500 chars: <1ms per render
  • 500-1,000 chars: ~2-3ms per render
  • 1,000-2,000 chars: ~5-7ms per render
  • 2,000+ chars: ~7-10ms per render

For very long documents (10,000+ characters), consider the optimization strategies below.

Best Practices

1. Prefer Word-Level Chunking

If you control the chunking strategy (e.g., in a custom SSE endpoint), emit tokens at word boundaries rather than individual characters. This reduces the total number of re-renders while producing the same visual result:

// Server-side: buffer tokens and emit at word boundaries
let buffer = ''
for await (const token of llmStream) {
    buffer += token
    if (buffer.endsWith(' ') || buffer.endsWith('\n')) {
        controller.enqueue(encoder.encode(buffer))
        buffer = ''
    }
}
if (buffer) controller.enqueue(encoder.encode(buffer))
// Server-side: buffer tokens and emit at word boundaries
let buffer = ''
for await (const token of llmStream) {
    buffer += token
    if (buffer.endsWith(' ') || buffer.endsWith('\n')) {
        controller.enqueue(encoder.encode(buffer))
        buffer = ''
    }
}
if (buffer) controller.enqueue(encoder.encode(buffer))

2. Use Token Caching for Chat History

When displaying a conversation with multiple messages, previously completed messages are re-rendered on every update. Enable token caching so that completed messages skip re-parsing:

<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'

    let messages = $state([])
</script>

{#each messages as message}
    <!-- Completed messages hit the token cache automatically -->
    <SvelteMarkdown source={message.content} />
{/each}
<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'

    let messages = $state([])
</script>

{#each messages as message}
    <!-- Completed messages hit the token cache automatically -->
    <SvelteMarkdown source={message.content} />
{/each}

3. Debounce for Extremely Fast Streams

If your LLM stream is unusually fast (100+ tokens/second) and you notice frame drops, you can batch updates using requestAnimationFrame:

let pending = ''
let rafScheduled = false

function onToken(token) {
    pending += token
    if (!rafScheduled) {
        rafScheduled = true
        requestAnimationFrame(() => {
            source += pending
            pending = ''
            rafScheduled = false
        })
    }
}
let pending = ''
let rafScheduled = false

function onToken(token) {
    pending += token
    if (!rafScheduled) {
        rafScheduled = true
        requestAnimationFrame(() => {
            source += pending
            pending = ''
            rafScheduled = false
        })
    }
}

This coalesces multiple tokens into a single render per frame, reducing total renders from ~100/sec to ~60/sec while maintaining smooth visual output.

Estimating Streaming Costs

When building production LLM streaming UIs, understanding token costs is as important as render performance. Each streamed token has a price that varies by model, provider, and whether it’s an input or output token. ModelPricing.ai provides a pricing estimation API that covers all major LLM providers — useful for displaying real-time cost tracking alongside your streamed responses, setting usage budgets, or building cost-aware model selection into your application.

Try It Live

Experiment with different streaming speeds, jitter, chunk modes, and websocket-style offset patch simulation in the interactive LLM streaming demo.

Related