Skip to main content
WSS
/
apis
/
live-tts
/
ws

Documentation Index

Fetch the complete documentation index at: https://docs.camb.ai/llms.txt

Use this file to discover all available pages before exploring further.

Bidirectional WebSocket endpoint for real-time text-to-speech. Push text as you have it; receive audio as it’s synthesized, in strict segment order. Designed for live captions, narration over streaming LLM output, interactive voice apps — anywhere you want playback to start before the writer is finished writing.
wss://client.camb.ai/apis/live-tts/ws
Authenticate with your CambAI API key via the x-api-key header (or ?api_key=... query parameter for clients that can’t set headers).

Quickstart

A complete, copy-pasteable client. Connect → configure → stream text → write the audio to a file.
import asyncio
import json
import websockets


async def synthesize(api_key: str, text: str, out_path: str = "out.mp3") -> None:
    url = "wss://client.camb.ai/apis/live-tts/ws"
    async with websockets.connect(
        url,
        additional_headers=[("x-api-key", api_key)],
    ) as ws:
        # 1. First frame: session config.
        await ws.send(json.dumps({
            "type": "session.start",
            "voice_id": 6460,
            "language": "en-us",
            "output_format": "mp3",
            "word_timestamps": True,
        }))

        # 2. Server confirms with session.ready.
        ready = json.loads(await ws.recv())
        assert ready["type"] == "session.ready", ready
        print(f"session {ready['session_id']} run_id={ready['run_id']}")

        # 3. Stream text. You can call text.chunk many times; the server
        #    segments based on content (and on a 1s idle flush).
        await ws.send(json.dumps({"type": "text.chunk", "text": text}))
        await ws.send(json.dumps({"type": "text.done"}))

        # 4. Receive ordered audio + json frames until the session ends.
        with open(out_path, "wb") as f:
            async for msg in ws:
                if isinstance(msg, bytes):
                    f.write(msg)
                    continue

                frame = json.loads(msg)
                kind = frame["type"]
                if kind == "segment.start":
                    print(f"  segment {frame['segment_id']}: {frame['text']!r}")
                    for w in frame.get("word_timestamps", []):
                        print(f"    {w['start']:6.2f}s → {w['end']:6.2f}s  {w['word']}")
                elif kind == "segment.skipped":
                    print(f"  ! skipped segment {frame['segment_id']}: {frame['text']!r}")
                elif kind == "session.done":
                    print("done")
                    break
                elif kind == "session.error":
                    raise RuntimeError(frame["error"])


asyncio.run(synthesize(
    api_key="your-camb-api-key",
    text="Hello, world. This is a streaming text-to-speech demo.",
))
That’s the whole integration surface. Everything below is reference for the four message types you’ll exchange.

Integration in 4 steps

1

Open the socket with your API key

async with websockets.connect(
    "wss://client.camb.ai/apis/live-tts/ws",
    additional_headers=[("x-api-key", "your-camb-api-key")],
) as ws:
    ...
Missing or invalid key → server closes with code 4401.
2

Send `session.start` as the first frame

{
  "type": "session.start",
  "voice_id": 6460,
  "language": "en-us",
  "output_format": "mp3",
  "word_timestamps": true,

  "enhance_named_entities_pronunciation": false,
  "apply_enhancement": null,
  "enhance_reference_audio_quality": false,
  "maintain_source_accent": false,
  "speaking_rate": null,
  "inference_steps": null
}
voice_id is the only required field — everything else has a sensible default. The tuning knobs mirror the regular POST /tts-stream API one-for-one (enhance_named_entities_pronunciation, apply_enhancement, enhance_reference_audio_quality, maintain_source_accent, speaking_rate), so you can port a working /tts-stream payload directly. See the full reference at the top of the page for types and defaults.Wait for the session.ready reply (carries session_id and run_id). A malformed first frame, forbidden voice, or unsupported language → session.error then close 4400.
3

Stream text in

{"type": "text.chunk", "text": "Hello, "}
{"type": "text.chunk", "text": "world."}
Push as fast or as slowly as you like. The server segments by content (sentence boundaries), and idle-flushes after 1 second of silence — so for live use cases (LLM token stream, transcribed mic input) you don’t need to send text.done until the session is truly over.
4

Read ordered audio + lifecycle frames

For each segment N, the server emits, in order:
segment.start N → <binary audio chunks> → segment.done N
Segment N’s frames are completely emitted before any of segment N+1’s, even though synthesis runs concurrently behind the scenes. Concatenate the binary frames per segment_id and you have playable audio.When everything is done you’ll receive session.done, followed by a clean close.

Common patterns

Stream from an LLM

Push tokens straight from the model. Don’t call text.done — let the idle flush handle in-flight buffering, then close when the LLM is done.
async for token in llm_stream():
    await ws.send(json.dumps({"type": "text.chunk", "text": token}))

# LLM finished; flush any tail and end cleanly.
await ws.send(json.dumps({"type": "text.done"}))

Play audio while it’s still synthesizing

Hand each segment to your player as soon as segment.done arrives:
buffers: dict[int, bytearray] = {}
current_segment: int | None = None

async for msg in ws:
    if isinstance(msg, bytes):
        if current_segment is not None:
            buffers.setdefault(current_segment, bytearray()).extend(msg)
        continue

    frame = json.loads(msg)
    if frame["type"] == "segment.start":
        current_segment = frame["segment_id"]
    elif frame["type"] == "segment.done":
        sid = frame["segment_id"]
        player.enqueue(bytes(buffers.pop(sid)))   # play this segment
        current_segment = None
    elif frame["type"] == "session.done":
        break

Recover from a skipped segment

segment.skipped means TTS retries (3 by default, exponential backoff) were exhausted for that segment. The session keeps running — re-send the text in a new text.chunk if you need the audio:
if frame["type"] == "segment.skipped":
    await ws.send(json.dumps({"type": "text.chunk", "text": frame["text"]}))

Word-level timestamps

Set "word_timestamps": true in session.start. When resolution succeeds, segment.start carries a word_timestamps array:
{
  "type": "segment.start",
  "segment_id": 0,
  "text": "Hello, world.",
  "word_timestamps": [
    {"word": "Hello", "start": 0.04, "end": 0.32},
    {"word": "world", "start": 0.38, "end": 0.71}
  ]
}
Word-timestamp failures (timeout, 5xx, network) are silently swallowed; the segment is still delivered without the word_timestamps field. Treat it as best-effort — don’t block playback on it.

Reference

The AsyncAPI spec above documents every message type and field. Quick lookup:

Close codes

CodeReason
4400Bad first frame, forbidden voice, or unsupported language.
4401Missing or invalid API key.
4402Insufficient credits.

Auth & billing

  • API key auth is identical to the rest of /apis/*.
  • A TTS_API Run is created on session.start; its run_id is in session.ready and can be queried later via the standard run endpoints.
  • Credits are deducted per segment, immediately before that segment is synthesized. If you run out mid-session, the server emits a single session.error and closes with 4402.

Voice & language

Voice access uses the same rules as /tts-stream. The session is pinned to the mars-8.1-flash-beta speech model — see the streaming TTS docs for the supported BCP-47 locales. For best results, supply a reference voice in the same language/accent as language.

Server-side TTS retries

ConnectionError / TimeoutError / OSError / aiohttp.ClientError against the underlying TTS engine trigger up to 3 retries per segment with exponential backoff. On exhaustion the segment becomes segment.skipped (see Recover from a skipped segment above) and the rest of the session continues normally.
Messages
Session Accepted
type:object

Sent immediately after session.start is accepted.

Segment Start
type:object

Marks the beginning of a synthesized segment. Followed by one or more binary audio frames and then segment.done.

Binary Audio Frame
type:string

Raw audio bytes for the current segment. Up to LIVE_TTS_AUDIO_FRAME_MAX_BYTES (default 65536) per frame.

Segment Done
type:object

All audio for the current segment has been emitted.

Segment Skipped
type:object

TTS retries were exhausted for this segment. The session continues; resend the text via text.chunk if needed.

Session Done
type:object

Pipeline drained, all segments emitted. Followed by a normal close.

Session Error
type:object

Fatal session-level error. Followed by a close with code 4400 / 4401 / 4402.

Start Session (first frame)
type:object

Must be the very first message sent on the WebSocket. Configures the synthesis run.

Append Text
type:object

Push more text into the synthesis buffer. The server segments based on content, not chunk boundaries.

End of Input
type:object

Flush whatever is buffered and finish. Optional — the server also flushes after LIVE_TTS_IDLE_FLUSH_SECONDS (default 1s) of silence.