🚀 Introducing MARS8 Series — Four Powerful Variants | Available on All Major Clouds | Learn about the model here
🚀 Introducing MARS8 Series — Four Powerful Variants | Available on All Major Clouds | Learn about the model here
Stream text in, receive synthesized speech audio + optional word-level timestamps in real time over a single WebSocket connection.
Bidirectional WebSocket endpoint for real-time text-to-speech. Push text as you have it; receive audio as it’s synthesized, in strict segment order. Designed for live captions, narration over streaming LLM output, interactive voice apps — anywhere you want playback to start before the writer is finished writing.Documentation Index
Fetch the complete documentation index at: https://docs.camb.ai/llms.txt
Use this file to discover all available pages before exploring further.
wss://client.camb.ai/apis/live-tts/ws
x-api-key header (or ?api_key=... query parameter for clients that can’t set headers).
import asyncio
import json
import websockets
async def synthesize(api_key: str, text: str, out_path: str = "out.mp3") -> None:
url = "wss://client.camb.ai/apis/live-tts/ws"
async with websockets.connect(
url,
additional_headers=[("x-api-key", api_key)],
) as ws:
# 1. First frame: session config.
await ws.send(json.dumps({
"type": "session.start",
"voice_id": 6460,
"language": "en-us",
"output_format": "mp3",
"word_timestamps": True,
}))
# 2. Server confirms with session.ready.
ready = json.loads(await ws.recv())
assert ready["type"] == "session.ready", ready
print(f"session {ready['session_id']} run_id={ready['run_id']}")
# 3. Stream text. You can call text.chunk many times; the server
# segments based on content (and on a 1s idle flush).
await ws.send(json.dumps({"type": "text.chunk", "text": text}))
await ws.send(json.dumps({"type": "text.done"}))
# 4. Receive ordered audio + json frames until the session ends.
with open(out_path, "wb") as f:
async for msg in ws:
if isinstance(msg, bytes):
f.write(msg)
continue
frame = json.loads(msg)
kind = frame["type"]
if kind == "segment.start":
print(f" segment {frame['segment_id']}: {frame['text']!r}")
for w in frame.get("word_timestamps", []):
print(f" {w['start']:6.2f}s → {w['end']:6.2f}s {w['word']}")
elif kind == "segment.skipped":
print(f" ! skipped segment {frame['segment_id']}: {frame['text']!r}")
elif kind == "session.done":
print("done")
break
elif kind == "session.error":
raise RuntimeError(frame["error"])
asyncio.run(synthesize(
api_key="your-camb-api-key",
text="Hello, world. This is a streaming text-to-speech demo.",
))
Open the socket with your API key
async with websockets.connect(
"wss://client.camb.ai/apis/live-tts/ws",
additional_headers=[("x-api-key", "your-camb-api-key")],
) as ws:
...
4401.Send `session.start` as the first frame
{
"type": "session.start",
"voice_id": 6460,
"language": "en-us",
"output_format": "mp3",
"word_timestamps": true,
"enhance_named_entities_pronunciation": false,
"apply_enhancement": null,
"enhance_reference_audio_quality": false,
"maintain_source_accent": false,
"speaking_rate": null,
"inference_steps": null
}
voice_id is the only required field — everything else has a sensible default. The tuning knobs mirror the regular POST /tts-stream API one-for-one (enhance_named_entities_pronunciation, apply_enhancement, enhance_reference_audio_quality, maintain_source_accent, speaking_rate), so you can port a working /tts-stream payload directly. See the full reference at the top of the page for types and defaults.Wait for the session.ready reply (carries session_id and run_id). A malformed first frame, forbidden voice, or unsupported language → session.error then close 4400.Stream text in
{"type": "text.chunk", "text": "Hello, "}
{"type": "text.chunk", "text": "world."}
text.done until the session is truly over.Read ordered audio + lifecycle frames
segment.start N → <binary audio chunks> → segment.done N
segment_id and you have playable audio.When everything is done you’ll receive session.done, followed by a clean close.text.done — let the idle flush handle in-flight buffering, then close when the LLM is done.
async for token in llm_stream():
await ws.send(json.dumps({"type": "text.chunk", "text": token}))
# LLM finished; flush any tail and end cleanly.
await ws.send(json.dumps({"type": "text.done"}))
segment.done arrives:
buffers: dict[int, bytearray] = {}
current_segment: int | None = None
async for msg in ws:
if isinstance(msg, bytes):
if current_segment is not None:
buffers.setdefault(current_segment, bytearray()).extend(msg)
continue
frame = json.loads(msg)
if frame["type"] == "segment.start":
current_segment = frame["segment_id"]
elif frame["type"] == "segment.done":
sid = frame["segment_id"]
player.enqueue(bytes(buffers.pop(sid))) # play this segment
current_segment = None
elif frame["type"] == "session.done":
break
segment.skipped means TTS retries (3 by default, exponential backoff) were exhausted for that segment. The session keeps running — re-send the text in a new text.chunk if you need the audio:
if frame["type"] == "segment.skipped":
await ws.send(json.dumps({"type": "text.chunk", "text": frame["text"]}))
"word_timestamps": true in session.start. When resolution succeeds, segment.start carries a word_timestamps array:
{
"type": "segment.start",
"segment_id": 0,
"text": "Hello, world.",
"word_timestamps": [
{"word": "Hello", "start": 0.04, "end": 0.32},
{"word": "world", "start": 0.38, "end": 0.71}
]
}
word_timestamps field. Treat it as best-effort — don’t block playback on it.
| Code | Reason |
|---|---|
4400 | Bad first frame, forbidden voice, or unsupported language. |
4401 | Missing or invalid API key. |
4402 | Insufficient credits. |
/apis/*.TTS_API Run is created on session.start; its run_id is in session.ready and can be queried later via the standard run endpoints.session.error and closes with 4402./tts-stream. The session is pinned to the mars-8.1-flash-beta speech model — see the streaming TTS docs for the supported BCP-47 locales. For best results, supply a reference voice in the same language/accent as language.
ConnectionError / TimeoutError / OSError / aiohttp.ClientError against the underlying TTS engine trigger up to 3 retries per segment with exponential backoff. On exhaustion the segment becomes segment.skipped (see Recover from a skipped segment above) and the rest of the session continues normally.No examples foundNo examples found{}{
"type": "<string>",
"segment_id": 123
}{
"type": "<string>",
"segment_id": 123,
"text": "<string>"
}{
"type": "<string>"
}{
"type": "<string>",
"error": "<string>"
}{
"type": "<string>",
"voice_id": 123,
"language": "<string>",
"output_format": "<string>",
"word_timestamps": true,
"enhance_named_entities_pronunciation": true,
"apply_enhancement": true,
"enhance_reference_audio_quality": true,
"maintain_source_accent": true,
"speaking_rate": 123,
"sample_rate": 123,
"inference_steps": 123
}{
"type": "<string>",
"text": "<string>",
"index": 123
}{
"type": "<string>"
}Sent immediately after session.start is accepted.
Marks the beginning of a synthesized segment. Followed by one or more binary audio frames and then segment.done.
Raw audio bytes for the current segment. Up to LIVE_TTS_AUDIO_FRAME_MAX_BYTES (default 65536) per frame.
All audio for the current segment has been emitted.
TTS retries were exhausted for this segment. The session continues; resend the text via text.chunk if needed.
Pipeline drained, all segments emitted. Followed by a normal close.
Fatal session-level error. Followed by a close with code 4400 / 4401 / 4402.
Must be the very first message sent on the WebSocket. Configures the synthesis run.
Push more text into the synthesis buffer. The server segments based on content, not chunk boundaries.
Flush whatever is buffered and finish. Optional — the server also flushes after LIVE_TTS_IDLE_FLUSH_SECONDS (default 1s) of silence.
No examples foundNo examples found{}{
"type": "<string>",
"segment_id": 123
}{
"type": "<string>",
"segment_id": 123,
"text": "<string>"
}{
"type": "<string>"
}{
"type": "<string>",
"error": "<string>"
}{
"type": "<string>",
"voice_id": 123,
"language": "<string>",
"output_format": "<string>",
"word_timestamps": true,
"enhance_named_entities_pronunciation": true,
"apply_enhancement": true,
"enhance_reference_audio_quality": true,
"maintain_source_accent": true,
"speaking_rate": 123,
"sample_rate": 123,
"inference_steps": 123
}{
"type": "<string>",
"text": "<string>",
"index": 123
}{
"type": "<string>"
}