Convert text to speech in real-time with customizable voice characteristics, delivering audio content as itβs generated for immediate playback in your applications.
Submit Your Text & Configuration
Receive the Audio Stream
Manage Playback & Usage
X-Credits-Required header included in the response.speech_model: Specify the model for synthesis.user_instructions: Guide the voiceβs delivery (e.g., βWarm, clear, and conversationalβ). Only supported with mars-instruct.output_configuration: Set the audio format (wav, mp3), and apply enhancements.voice_settings: Enhance reference audio quality or maintain the source accent.inference_options: Adjust stability, temperature, and speaker similarity for unique results.user_instructions is only supported when speech_model is set to mars-instruct.output_configuration.format values depend on the selected speech_model:
| Speech Model | Supported output formats |
|---|---|
mars-pro | wav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be |
mars-flash | wav, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be |
mars-instruct | wav, flac, adts, pcm_s16le, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be |
The x-api-key is a custom header required for authenticating requests to our API. Include this header in your request with the appropriate API key value to securely access our endpoints. You can find your API key(s) in the 'API' section of our studio website.
Streaming Text-to-Speech request parameters.
Request body for /tts-stream.
The text to synthesize into speech (3β3000 characters).
3 - 3000"Jupiter, the largest planet in our solar system, is a gas giant with swirling storms."
BCP-47 locale for the input text (for example, en-us).
ro-ro, nl-nl, es-es, zh-tw, en-uk, el-gr, cs-cz, vi-vn, bn-bd, ar-tn, de-de, fr-ca, ar-xa, th-th, ar-eg, ar-sa, ar-sy, pa-in, zh-cn, ar-jo, ru-ru, bn-in, uk-ua, es-us, ja-jp, ar-ae, mr-in, en-au, de-ch, pt-pt, ar-kw, ar-qa, as-in, hi-in, fr-be, fi-fi, fr-fr, ar-dz, fr-ch, it-it, de-at, en-in, ko-kr, en-us, zh-hk, ar-om, ar-ma, pl-pl, ar-ly, es-mx, tr-tr, ar-iq, ar-lb, ml-in, pt-br, id-id, ar-bh, kn-in, nl-be, te-in, ar-ye, ta-in "en-us"
Voice profile ID to use for synthesis. Get available IDs from /list-voices.
x >= 1147320
Speech model variant to use for synthesis.
mars-pro, mars-flash, mars-instruct "mars-pro"
Optional guidance for style, tone, pronunciation, or delivery.
3 - 1000If true, improves pronunciation of names, brands, and other named entities.
true
Controls output format and enhancement options for the stream.
{
"format": "wav",
"duration": null,
"apply_enhancement": true
}Voice behavior preferences such as accent preservation and reference enhancement.
{
"enhance_reference_audio_quality": false,
"maintain_source_accent": false
}Model sampling controls that trade off stability, variation, and latency.
{
"stability": 0.6,
"temperature": 0.8,
"inference_steps": 60,
"speaker_similarity": 0.7
}Streaming audio response
Binary audio stream in WAV format.