AI Conversation Streaming

LLM inference is expensive. When a user’s tab gets suspended, their network flaps, or they refresh the page, you don’t want to re-run the generation—you want them to pick up exactly where they left off. Durable Streams makes AI conversation streaming production-ready with built-in resumability, exactly-once delivery, and multi-viewer support.

The problem with ephemeral streams

Traditional SSE or WebSocket streaming for AI responses faces critical issues:

Tab suspension - Mobile browsers and backgrounded tabs kill connections mid-generation
Page refresh - Users lose partial responses, forcing expensive re-generation
Network flaps - Brief disconnections lose all in-flight tokens
No sharing - Can’t share a live generation link with teammates
No resumption - Reconnections start from scratch or show nothing

With Durable Streams, the generation continues server-side even if the client disconnects. Users reconnect and resume from their last seen token—no re-generation needed.

Basic implementation

Server: Stream tokens to durable stream

Stream LLM tokens to a durable stream. The generation continues even if the client disconnects.

import { DurableStream } from "@durable-streams/client"

// Create a stream for this generation
const stream = await DurableStream.create({
  url: `https://your-server.com/v1/stream/generation/${generationId}`,
  contentType: "text/plain",
})

// Stream tokens as they arrive from the LLM
for await (const token of llm.stream(prompt)) {
  await stream.append(token)
}

// Close stream when generation completes
await stream.close()

Client: Resume from last position

Clients read from their last seen offset. If they disconnect and reconnect, they continue from where they left off.

import { stream } from "@durable-streams/client"

// Resume from saved offset (or "-1" for new generation)
const res = await stream({
  url: `https://your-server.com/v1/stream/generation/${generationId}`,
  offset: lastSeenOffset,
  live: true,
})

// Render tokens as they arrive
res.subscribe((chunk) => {
  renderTokens(chunk.data)
  saveOffset(chunk.offset) // Persist for next resume
})

Production-grade implementation with IdempotentProducer

For high-throughput, reliable token delivery with exactly-once guarantees, use IdempotentProducer:

import {
  DurableStream,
  IdempotentProducer,
  StaleEpochError,
} from "@durable-streams/client"

// Create stream for this generation
const stream = await DurableStream.create({
  url: `https://your-server.com/v1/stream/generation/${generationId}`,
  contentType: "text/plain",
})

// Create idempotent producer for reliable, exactly-once delivery
let fenced = false
const producer = new IdempotentProducer(stream, generationId, {
  autoClaim: true,  // Auto-recover if another worker took over
  lingerMs: 10,     // Send batches every 10ms for low latency
  onError: (error) => {
    if (error instanceof StaleEpochError) {
      fenced = true
      console.log("Another worker took over, stopping gracefully")
    }
  },
})

// Stream tokens - fire-and-forget, automatically batched & pipelined
for await (const token of llm.stream(prompt)) {
  if (fenced) break
  producer.append(token) // Don't await - errors go to onError callback
}

// Ensure all tokens are delivered before shutdown
await producer.flush()
await producer.close()

What this enables

Tab suspended

User’s tab gets suspended by the browser? They come back and catch up from their saved offset—no re-generation.

Page refresh

User refreshes mid-generation? They continue from the last token, not from the beginning.

Share live streams

Multiple viewers can watch the same stream in real-time. Share a generation link with teammates.

Multi-device

Start on mobile, continue on desktop—same stream, same position across all devices.

Worker failover

If a worker crashes mid-generation, another can take over with a new epoch—no duplicate tokens.

CDN scaling

Shared streams leverage CDN caching. Thousands of viewers = one origin request.

Streaming structured output

For structured LLM responses (JSON, code, etc.), use application/json content type:

import { DurableStream, IdempotentProducer } from "@durable-streams/client"

const stream = await DurableStream.create({
  url: `https://your-server.com/v1/stream/generation/${generationId}`,
  contentType: "application/json",
})

const producer = new IdempotentProducer(stream, generationId, {
  autoClaim: true,
  lingerMs: 10,
})

// Stream structured events
for await (const event of llm.streamStructured(prompt)) {
  // Each event is a complete JSON object
  producer.append(JSON.stringify(event))
}

await producer.flush()
await producer.close()

Agentic applications

For agentic apps that stream tool outputs and progress events over long-running sessions:

import { DurableStream, IdempotentProducer } from "@durable-streams/client"

const stream = await DurableStream.create({
  url: `https://your-server.com/v1/stream/agent/${sessionId}`,
  contentType: "application/json",
})

const producer = new IdempotentProducer(stream, `agent-${sessionId}`, {
  autoClaim: true,
})

// Stream agent progress events
for await (const event of agent.run(task)) {
  producer.append(JSON.stringify({
    type: event.type, // "thinking", "tool_call", "result", etc.
    timestamp: Date.now(),
    data: event.data,
  }))
}

await producer.flush()
await producer.close()

Clients receive all events in order, even across reconnections:

const res = await stream({
  url: `https://your-server.com/v1/stream/agent/${sessionId}`,
  offset: lastSeenOffset,
  live: true,
})

res.subscribeJson(async (batch) => {
  for (const event of batch.items) {
    switch (event.type) {
      case "thinking":
        showThinkingIndicator(event.data)
        break
      case "tool_call":
        renderToolCall(event.data)
        break
      case "result":
        displayResult(event.data)
        break
    }
  }
  saveOffset(batch.offset)
})

Handling generation errors

Store error states in the stream itself so clients see them on reconnection:

try {
  for await (const token of llm.stream(prompt)) {
    producer.append(token)
  }
  await producer.flush()
  await producer.close()
} catch (error) {
  // Write error as final message before closing
  const errorMessage = JSON.stringify({
    type: "error",
    message: error.message,
    timestamp: Date.now(),
  })
  await producer.close(errorMessage)
}

Clients check the streamClosed flag to detect completion:

res.subscribeJson(async (batch) => {
  for (const event of batch.items) {
    if (event.type === "error") {
      showError(event.message)
    } else {
      renderTokens(event)
    }
  }
  
  if (batch.streamClosed) {
    console.log("Generation complete")
  }
  
  saveOffset(batch.offset)
})

Multi-viewer scenarios

A stream is just a URL. Multiple viewers can watch the same stream together in real-time without duplicating work.

Shared debugging

// Customer support watching a user's AI session
const res = await stream({
  url: `https://your-server.com/v1/stream/user/${userId}/session/${sessionId}`,
  offset: "-1", // Watch from beginning
  live: true,
})

res.subscribe((chunk) => {
  // Support agent sees same tokens as user in real-time
  console.log(chunk.data)
})

Live collaboration

// Team watching a shared AI brainstorming session
const res = await stream({
  url: `https://your-server.com/v1/stream/workspace/${workspaceId}/brainstorm`,
  offset: "now", // Skip history, join live
  live: true,
})

res.subscribeJson(async (batch) => {
  for (const idea of batch.items) {
    displayIdea(idea)
  }
})

Performance optimization

Batching for throughput

const producer = new IdempotentProducer(stream, generationId, {
  maxBatchBytes: 1024 * 1024, // 1MB batches
  lingerMs: 50,               // Wait up to 50ms to accumulate batch
  maxInFlight: 5,             // Pipeline up to 5 concurrent batches
})

// High-throughput streaming - batched automatically
for await (const token of llm.stream(prompt)) {
  producer.append(token) // Returns immediately
}

Low-latency configuration

const producer = new IdempotentProducer(stream, generationId, {
  lingerMs: 10,      // Send batches every 10ms for low latency
  maxBatchBytes: 4096, // Smaller batches for faster delivery
})

CDN configuration

Configure your CDN to cache historical reads and collapse concurrent requests to the same offset.

Example Cloudflare configuration:

// Cloudflare Worker
export default {
  async fetch(request, env) {
    const url = new URL(request.url)
    
    // Cache historical reads (offset != "now")
    if (url.searchParams.get("offset") !== "now") {
      const cache = caches.default
      let response = await cache.match(request)
      
      if (!response) {
        response = await fetch(request)
        // Cache for 60 seconds with stale-while-revalidate
        response = new Response(response.body, {
          ...response,
          headers: {
            ...response.headers,
            "Cache-Control": "public, max-age=60, stale-while-revalidate=300",
          },
        })
        await cache.put(request, response.clone())
      }
      
      return response
    }
    
    return fetch(request)
  },
}

Next steps

Idempotent Producer

Learn about exactly-once delivery guarantees

Client Libraries

Explore streaming APIs for all platforms

SSE vs Long-Poll

Choose the right live mode for your use case

Error Handling

Handle errors and implement retry strategies

Get Started

Core Concepts

Guides

Use Cases

The problem with ephemeral streams

Basic implementation

Production-grade implementation with IdempotentProducer

What this enables

Tab suspended

Page refresh

Share live streams

Multi-device

Worker failover

CDN scaling

Streaming structured output

Agentic applications

Handling generation errors

Multi-viewer scenarios

Shared debugging

Live collaboration

Performance optimization

Batching for throughput

Low-latency configuration

CDN configuration

Next steps

Idempotent Producer

Client Libraries

SSE vs Long-Poll

Error Handling

Get Started

Core Concepts

Guides

Use Cases

Documentation Index

​The problem with ephemeral streams

​Basic implementation

​Production-grade implementation with IdempotentProducer

​What this enables

Tab suspended

Page refresh

Share live streams

Multi-device

Worker failover

CDN scaling

​Streaming structured output

​Agentic applications

​Handling generation errors

​Multi-viewer scenarios

​Shared debugging

​Live collaboration

​Performance optimization

​Batching for throughput

​Low-latency configuration

​CDN configuration

​Next steps

Idempotent Producer

Client Libraries

SSE vs Long-Poll

Error Handling

The problem with ephemeral streams

Basic implementation

Production-grade implementation with IdempotentProducer

What this enables

Streaming structured output

Agentic applications

Handling generation errors

Multi-viewer scenarios

Shared debugging

Live collaboration

Performance optimization

Batching for throughput

Low-latency configuration

CDN configuration

Next steps