Skip to main content
POST
/
tts-stream
Stream Text-to-Speech Audio
curl --request POST \
  --url https://client.camb.ai/apis/tts-stream \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "text": "Jupiter, the largest planet in our solar system, is a gas giant with swirling storms.",
  "language": "en-us",
  "voice_id": 147320
}
'
"<string>"
This endpoint provides a real-time text-to-speech experience by streaming audio as it’s generated. This allows your application to begin playback instantly, creating ultra-low-latency, interactive experiences without waiting for a full audio file to be created.

How the Streaming Process Works

Our streaming service is designed for simplicity and speed. Here’s how it works from request to playback:
1

Submit Your Text & Configuration

Send a POST request containing your text and desired audio configuration, including the voice, language, and output format.
2

Receive the Audio Stream

The server immediately begins processing and sends audio data back in chunks over the same connection. Your application can start playing the audio as soon as the first chunk arrives.
3

Manage Playback & Usage

Continue reading the byte stream until the connection closes, which signals the end of the audio. You can also monitor real-time usage via the X-Credits-Required header included in the response.

Configuring Your Audio Stream

Tailor the voice and audio output to fit your exact needs with these configuration options.

Advanced Customization

Fine-tune the audio with additional parameters to control the performance, style, and quality of the generated speech. These are sent in the same JSON payload.
  • speech_model: Specify the model for synthesis.
  • user_instructions: Guide the voice’s delivery (e.g., β€œWarm, clear, and conversational”). Only supported with mars-instruct.
  • output_configuration: Set the audio format (wav, mp3), and apply enhancements.
  • voice_settings: Enhance reference audio quality or maintain the source accent.
  • inference_options: Adjust stability, temperature, and speaker similarity for unique results.
user_instructions is only supported when speech_model is set to mars-instruct.

Output format support by model

Supported output_configuration.format values depend on the selected speech_model:
Speech ModelSupported output formats
mars-prowav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be
mars-flashwav, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be
mars-instructwav, flac, adts, pcm_s16le, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be

Example: Real-time Audio Streaming

This example shows how to call the endpoint and save the incoming audio stream to a file.
import requests

payload = {
    "text": "Jupiter, the largest planet in our solar system, is a gas giant with swirling storms like the iconic Great Red Spot.",
    "language": "en-us",
    "voice_id": 147320,
    "speech_model": "mars-instruct",
    "user_instructions": "Warm, clear, and conversational.",
    "enhance_named_entities_pronunciation": True,
    "output_configuration": {
        "format": "wav",
        "duration": None,
        "apply_enhancement": True
    },
    "voice_settings": {
        "enhance_reference_audio_quality": False,
        "maintain_source_accent": False
    },
    "inference_options": {
        "stability": 0.6,
        "temperature": 0.8,
        "inference_steps": 60,
        "speaker_similarity": 0.7
    }
}

headers = {
    "x-api-key": "your-api-key"
}

response = requests.post(
    "https://client.camb.ai/apis/tts-stream",
    json=payload,
    headers=headers,
    stream=True
)

response.raise_for_status()

with open("output.wav", "wb") as audio_file:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            audio_file.write(chunk)

print("✨ Stream complete. Audio saved to output.wav")

Streaming vs. Asynchronous: Which to Choose?

Select the right tool for your job by understanding the key differences between our TTS endpoints.

Use Streaming

Ideal for real-time, interactive experiences where immediate audio feedback is crucial.

Use Asynchronous

Perfect for non-real-time tasks, long-form content, or when you need to retrieve a complete audio file later.

Authorizations

x-api-key
string
header
required

The x-api-key is a custom header required for authenticating requests to our API. Include this header in your request with the appropriate API key value to securely access our endpoints. You can find your API key(s) in the 'API' section of our studio website.

Body

application/json

Streaming Text-to-Speech request parameters.

Request body for /tts-stream.

text
string
required

The text to synthesize into speech (3–3000 characters).

Required string length: 3 - 3000
Example:

"Jupiter, the largest planet in our solar system, is a gas giant with swirling storms."

language
enum<string>
required

BCP-47 locale for the input text (for example, en-us).

Available options:
ro-ro,
nl-nl,
es-es,
zh-tw,
en-uk,
el-gr,
cs-cz,
vi-vn,
bn-bd,
ar-tn,
de-de,
fr-ca,
ar-xa,
th-th,
ar-eg,
ar-sa,
ar-sy,
pa-in,
zh-cn,
ar-jo,
ru-ru,
bn-in,
uk-ua,
es-us,
ja-jp,
ar-ae,
mr-in,
en-au,
de-ch,
pt-pt,
ar-kw,
ar-qa,
as-in,
hi-in,
fr-be,
fi-fi,
fr-fr,
ar-dz,
fr-ch,
it-it,
de-at,
en-in,
ko-kr,
en-us,
zh-hk,
ar-om,
ar-ma,
pl-pl,
ar-ly,
es-mx,
tr-tr,
ar-iq,
ar-lb,
ml-in,
pt-br,
id-id,
ar-bh,
kn-in,
nl-be,
te-in,
ar-ye,
ta-in
Example:

"en-us"

voice_id
integer
required

Voice profile ID to use for synthesis. Get available IDs from /list-voices.

Required range: x >= 1
Example:

147320

speech_model
enum<string>
default:mars-pro

Speech model variant to use for synthesis.

Available options:
mars-pro,
mars-flash,
mars-instruct
Example:

"mars-pro"

user_instructions
string | null

Optional guidance for style, tone, pronunciation, or delivery.

Required string length: 3 - 1000
enhance_named_entities_pronunciation
boolean
default:false

If true, improves pronunciation of names, brands, and other named entities.

Example:

true

output_configuration
StreamTTSOutputConfiguration Β· object

Controls output format and enhancement options for the stream.

Example:
{
"format": "wav",
"duration": null,
"apply_enhancement": true
}
voice_settings
StreamTTSVoiceSettings Β· object

Voice behavior preferences such as accent preservation and reference enhancement.

Example:
{
"enhance_reference_audio_quality": false,
"maintain_source_accent": false
}
inference_options
StreamTTSInferenceOptions Β· object

Model sampling controls that trade off stability, variation, and latency.

Example:
{
"stability": 0.6,
"temperature": 0.8,
"inference_steps": 60,
"speaker_similarity": 0.7
}

Response

Streaming audio response

Binary audio stream in WAV format.