Live TTS (WebSocket)
Stream text in, receive synthesized speech audio + optional word-level timestamps in real time over a single WebSocket connection.
WSS
x-api-key header (or ?api_key=... query parameter for clients that can’t set headers).
Quickstart
A complete, copy-pasteable client. Connect → configure → stream text → write the audio to a file.Integration in 4 steps
Send `session.start` as the first frame
voice_id is the only required field — everything else has a sensible default. The tuning knobs mirror the regular POST /tts-stream API one-for-one (enhance_named_entities_pronunciation, apply_enhancement, enhance_reference_audio_quality, maintain_source_accent, speaking_rate), so you can port a working /tts-stream payload directly. See the full reference at the top of the page for types and defaults.Wait for the session.ready reply (carries session_id and run_id). A malformed first frame, forbidden voice, or unsupported language → session.error then close 4400.Stream text in
idle_timeout seconds of silence (default 1.0) — so for live use cases (LLM token stream, transcribed mic input) you don’t need to send text.done until the session is truly over.idle_timeout is only a fallback flush for trailing fragments without a boundary. A complete sentence (terminal punctuation, paragraph break, etc.) is flushed immediately — it never waits on idle_timeout. Bump the value on session.start (e.g. 2.5) if your producer routinely stalls mid-sentence — slower LLMs, token-level jitter — to avoid splitting one sentence across two segments.Read ordered audio + lifecycle frames
For each segment N, the server emits, in order:Segment N’s frames are completely emitted before any of segment N+1’s, even though synthesis runs concurrently behind the scenes. Concatenate the binary frames per
segment_id and you have playable audio.When everything is done you’ll receive session.done, followed by a clean close.Common patterns
Stream from an LLM
Push tokens straight from the model. Don’t calltext.done — let the idle flush handle in-flight buffering, then close when the LLM is done.
Play audio while it’s still synthesizing
Hand each segment to your player as soon assegment.done arrives:
Recover from a skipped segment
segment.skipped means TTS retries (3 by default, exponential backoff) were exhausted for that segment. The session keeps running — re-send the text in a new text.chunk if you need the audio:
Word-level timestamps
Set"word_timestamps": true in session.start. When resolution succeeds, segment.start carries a word_timestamps array:
word_timestamps field. Treat it as best-effort — don’t block playback on it.
Reference
The AsyncAPI spec above documents every message type and field. Quick lookup:Close codes
| Code | Reason |
|---|---|
4400 | Bad first frame, forbidden voice, or unsupported language. |
4401 | Missing or invalid API key. |
4402 | Insufficient credits. |
Auth & billing
- API key auth is identical to the rest of
/apis/*. - A
TTS_APIRun is created onsession.start; itsrun_idis insession.readyand can be queried later via the standard run endpoints. - Credits are deducted per segment, immediately before that segment is synthesized. If you run out mid-session, the server emits a single
session.errorand closes with4402.
Voice & language
Voice access uses the same rules as/tts-stream. The session is pinned to the mars-8.1-flash-beta speech model — see the streaming TTS docs for the supported BCP-47 locales. For best results, supply a reference voice in the same language/accent as language.
Server-side TTS retries
ConnectionError / TimeoutError / OSError / aiohttp.ClientError against the underlying TTS engine trigger up to 3 retries per segment with exponential backoff. On exhaustion the segment becomes segment.skipped (see Recover from a skipped segment above) and the rest of the session continues normally.Messages
Previous
Stream Text-to-Speech AudioConvert text to speech in real-time with customizable voice characteristics, delivering audio content as it's generated for immediate playback in your applications.
Next
Messages