Stream Text-to-Speech Audio
Convert text to speech in real-time with customizable voice characteristics, delivering audio content as it’s generated for immediate playback in your applications.
How the Streaming Process Works
Our streaming service is designed for simplicity and speed. Here’s how it works from request to playback:Submit Your Text & Configuration
Receive the Audio Stream
Language Support
Thelanguage field takes a BCP-47 locale code (e.g. en-us, hi-in, zh-cn). It controls the accent and pronunciation of the generated speech — the model does not translate the input text, so the text you supply should already be written in the target language.
Coverage by model
| Speech model | Locales supported |
|---|---|
mars-flash, mars-pro | 33 |
mars-8.1-flash-beta, mars-8.1-pro-beta | 312 |
mars-instruct | 141 |
Choosing a locale
- Use the most specific regional variant available for the accent you want. For example, prefer
es-mxoveres-esfor a Mexican Spanish accent, orzh-cn-sichuanoverzh-cnfor a Sichuan-flavored Mandarin. - For best results on the MARS 8.1 beta models, supply a reference voice in the same language and accent as the target locale.
- Codes are case-sensitive lowercase (e.g.
pt-br, notpt-BR).
Validation behavior
If the requestedlanguage is not supported by the selected speech_model, the API responds with HTTP 422 and a ValidationError body that lists the allowed locales for that model. Example:
Advanced Customization
Fine-tune the audio with additional parameters to control the performance, style, and quality of the generated speech. These are sent in the same JSON payload.speech_model: Specify the model for synthesis. Available values includemars-8.1-flash-beta,mars-8.1-pro-beta,mars-flash,mars-pro, andmars-instruct.- Expressive text tags: With
mars-instruct, you can also embed delivery tags directly in the text (for example, emotion tags or SSML-style pauses) to shape pacing and tone. output_configuration: Set the audio format (wav,mp3), sample rate, and toggle output enhancement.apply_enhancement(boolean, optional): Applies output audio enhancement (loudness, denoising, polish). Defaults totruefor most models,falsefor the speed-orientedmars-flashandmars-8.1-flash-betamodels. Set explicitly to override.
voice_settings: Enhance reference audio quality, maintain the source accent, or adjust the speaking rate.inference_options: Adjust stability, temperature, and speaker similarity for unique results.
mars-8.1-flash-beta and mars-8.1-pro-beta models do not support the following parameters:acoustic_quality_boosttemperaturespeaker_similaritymaintain_source_accentstabilityoutput_enhancementenhance_named_entities_pronunciationlocalize_speaker_weight
MARS 8.1 Beta Text Controls
Themars-8.1-flash-beta and mars-8.1-pro-beta models support inline controls for English pronunciation and expressive non-verbal sounds. Add these controls directly in the text field.
Pronunciation Control (English)
Use CMU pronunciation dictionary phonemes in uppercase, wrapped in brackets, to override default English pronunciations.Non-verbal Symbols
Insert supported tags directly in the text to add expressive non-verbal sounds.[laughter], [sigh], [confirmation], [question], [surprise], [dissatisfaction].
Expressive Text Tags (mars-instruct)
You can directly convey expression in the input text by adding short tags for delivery. For a deeper guide to emotion tag intensity, see the Emotion Tag Gradation Guide.
[speaking slowly] You need to understand this. It is very important. We should do this the right way.[angry] You need to understand this! It is very important, we should do this the right way![gentle, reassuring] Take a deep breath. You're doing well. Let's go step by step.Please pause here <break time="500ms"/> then continue in a calm, clear tone.
Tips For Best Results:
- For texts with numbers expand the numbers to words. For example, instead of “123” to “one hundred twenty three” or “one two three” as you need.
- For code-switched sentences, perform transliteration to convert the text to your chosen TTS language. Both of above could be done by a small LLM.
- To adjust pacing or approximate length, use
voice_settings.speaking_rate. The streaming TTS endpoint does not support a duration parameter.
Output format support by model
Supportedoutput_configuration.format values depend on the selected speech_model:
| Speech Model | Supported output formats |
|---|---|
mars-8.1-flash-beta | wav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be |
mars-8.1-pro-beta | wav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be |
mars-flash | wav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be |
mars-pro | wav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be |
mars-instruct | wav, flac, adts, pcm_s16le, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be |
Example: Real-time Audio Streaming
This example shows how to call the endpoint and save the incoming audio stream to a file.SDK Example: Async Streaming
Streaming vs. Asynchronous: Which to Choose?
Select the right tool for your job by understanding the key differences between our TTS endpoints.Use Streaming
Use Asynchronous
Authorizations
The x-api-key is a custom header required for authenticating requests to our API. Include this header in your request with the appropriate API key value to securely access our endpoints. You can find your API key(s) in the 'API' section of our studio website.
Body
Streaming Text-to-Speech request parameters.
Request body for /tts-stream.
The text to synthesize into speech (3–3000 characters). For mars-8.1-flash-beta and mars-8.1-pro-beta, you can include inline controls such as CMU phonemes ([B EY1 S]) and non-verbal tags ([laughter]).
3 - 3000"[laughter] He plays the [B EY1 S] guitar while catching a [B AE1 S] fish."
The language of the input text. Pass a locale tag (en-us, fr-fr, es-es). Numeric language IDs (1 or "1") still work but are deprecated. See all source languages.
ro-ro, nl-nl, es-es, zh-tw, en-uk, el-gr, cs-cz, vi-vn, bn-bd, ar-tn, de-de, fr-ca, ar-xa, th-th, ar-eg, ar-sa, ar-sy, pa-in, zh-cn, ar-jo, ru-ru, bn-in, uk-ua, es-us, ja-jp, ar-ae, mr-in, en-au, de-ch, pt-pt, ar-kw, ar-qa, as-in, hi-in, fr-be, fi-fi, fr-fr, ar-dz, fr-ch, it-it, de-at, en-in, ko-kr, en-us, zh-hk, ar-om, ar-ma, pl-pl, ar-ly, es-mx, tr-tr, ar-iq, ar-lb, ml-in, pt-br, id-id, ar-bh, kn-in, nl-be, te-in, ar-ye, ta-in, af-za, am-et, az-az, bg-bg, bs-ba, ca-es, cy-gb, da-dk, en-ca, en-gb, en-hk, en-ie, en-ke, en-ng, en-nz, en-ph, en-sg, en-tz, en-za, es-ar, es-bo, es-cl, es-co, es-cr, es-cu, es-do, es-ec, es-gq, es-gt, es-hn, es-ni, es-pa, es-pe, es-pr, es-py, es-sv, es-uy, es-ve, et-ee, eu-es, fa-ir, fil-ph, ga-ie, gl-es, gu-in, he-il, hr-hr, hu-hu, hy-am, is-is, jv-id, ka-ge, kk-kz, km-kh, lo-la, lt-lt, lv-lv, mk-mk, mn-mn, ms-my, mt-mt, my-mm, nb-no, ps-af, si-lk, sk-sk, sl-si, so-so, sq-al, sr-rs, sv-se, sw-ke, sw-tz, ta-lk, ta-my, ta-sg, ur-in, ur-pk, uz-uz, zh-cn-henan, zh-cn-liaoning, zh-cn-shaanxi, zh-cn-shandong, zh-cn-sichuan, zu-za, sa-in, tl-ph, es-xl, or-in, mai-in, sd-in, kok-in, mni-in, ks-in, doi-in, brx-in, sat-in, yue-hk, no-no, rw-rw, be-by, eo-xx, kab-dz, lg-ug, ug-cn, mhr-ru, ba-ru, ars-sa, npi-np, ckb-iq, dgo-in, knn-in, kbd-ru, ary-ma, afb-kw, bo-cn, fy-nl, kmr-xx, ab-ge, adx-cn, ky-kg, kln-xx, dv-mv, luo-xx, ady-ru, mrj-xx, tt-ru, ltg-xx, br-fr, phr-xx, cv-ru, arz-eg, gui-bo, acw-sa, acx-xx, orc-xx, mvy-xx, aeb-xx, gjk-xx, phl-xx, odk-xx, lus-xx, ayl-xx, fue-ne, hno-xx, kxp-xx, brh-pk, plt-xx, gbm-in, bmm-xx, rof-xx, ydd-xx, mi-nz, ln-cd, xmv-mg, tkg-xx, ha-ng, nan-xx, bzc-xx, oc-fr, oru-xx, ksf-cm, an-es, bft-pk, nnh-xx, sah-xx, pms-xx, lij-xx, yo-ng, vro-xx, apc-xx, khw-xx, uzn-xx, fui-cm, bnm-cm, trw-xx, fuc-xx, kam-xx, msh-xx, ff-sn, fuf-xx, pwn-xx, ig-xx, ext-es, tok-xx, ia-xx, xh-za, scn-xx, koo-xx, fub-cm, plk-xx, ewo-cm, nso-xx, gby-ng, gdf-ng, ceb-ph, gwt-af, kw-gb, bhr-mg, dua-cm, gbr-ng, txy-xx, mxu-xx, kna-ng, kfp-xx, its-xx, haw-us, tcy-xx, bjn-id, wbl-xx, pbt-xx, xmw-xx, szy-xx, xmf-xx, twu-xx, nlv-xx, qxw-xx, pst-af, wji-xx, fat-gh, bhh-il, tlp-mx, ldb-ng, ndi-xx, elm-ng, jns-xx, hwo-xx, bbl-ge, afo-ng, abb-cm, jal-xx, idu-xx, bew-id, qup-xx, noe-xx, byc-xx, bug-id, btm-id, hia-xx, deg-ng, kvx-xx, pcm-xx, ijn-xx, ala-ng, pbu-xx, bjj-xx, cjk-ao, mrr-xx, qvi-xx, bgp-pk, bag-xx "en-us"
Voice profile ID to use for synthesis. Get available IDs from /list-voices.
x >= 1147320
Speech model variant to use for synthesis. Use mars-8.1-flash-beta or mars-8.1-pro-beta to leverage inline pronunciation and non-verbal controls in text.
mars-8.1-flash-beta, mars-8.1-pro-beta, mars-flash, mars-pro, mars-instruct "mars-8.1-flash-beta"
If true, improves pronunciation of names, brands, and other named entities.
true
Controls output format and enhancement options for the stream.
{ "format": "wav" }Voice behavior preferences such as accent preservation and reference enhancement.
{
"enhance_reference_audio_quality": false,
"maintain_source_accent": false,
"speaking_rate": 1.5
}Response
Streaming audio response
Binary audio stream in WAV format.