Skip to main content
POST
/
tts-stream
Stream Text-to-Speech Audio
curl --request POST \
  --url https://client.camb.ai/apis/tts-stream \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "text": "[laughter] He plays the [B EY1 S] guitar while catching a [B AE1 S] fish.",
  "language": "en-us",
  "voice_id": 147320,
  "speech_model": "mars-8.1-flash-beta",
  "enhance_named_entities_pronunciation": true,
  "output_configuration": {
    "format": "wav"
  },
  "voice_settings": {
    "enhance_reference_audio_quality": false,
    "maintain_source_accent": false,
    "speaking_rate": 1.5
  }
}
'
"<string>"

Documentation Index

Fetch the complete documentation index at: https://docs.camb.ai/llms.txt

Use this file to discover all available pages before exploring further.

Camb AI Python SDK Examples Link To Detailed Models Overview

How the Streaming Process Works

Our streaming service is designed for simplicity and speed. Here’s how it works from request to playback:
1

Submit Your Text & Configuration

Send a POST request containing your text and desired audio configuration, including the voice, language, and output format.
2

Receive the Audio Stream

The server immediately begins processing and sends audio data back in chunks over the same connection. Your application can start playing the audio as soon as the first chunk arrives.
3

Manage Playback & Usage

Continue reading the byte stream until the connection closes, which signals the end of the audio. You can also monitor real-time usage via the X-Credits-Required header included in the response.

Language Support

The language field takes a BCP-47 locale code (e.g. en-us, hi-in, zh-cn). It controls the accent and pronunciation of the generated speech — the model does not translate the input text, so the text you supply should already be written in the target language.

Coverage by model

Speech modelLocales supported
mars-flash, mars-pro33
mars-8.1-flash-beta, mars-8.1-pro-beta158
mars-instruct141
See the full per-model locale list in Language Support.

Choosing a locale

  • Use the most specific regional variant available for the accent you want. For example, prefer es-mx over es-es for a Mexican Spanish accent, or zh-cn-sichuan over zh-cn for a Sichuan-flavored Mandarin.
  • For best results on the MARS 8.1 beta models, supply a reference voice in the same language and accent as the target locale.
  • Codes are case-sensitive lowercase (e.g. pt-br, not pt-BR).

Validation behavior

If the requested language is not supported by the selected speech_model, the API responds with HTTP 422 and a ValidationError body that lists the allowed locales for that model. Example:
{
  "detail": [{
    "loc": ["body"],
    "msg": "Value error, Language 'zh-tw' is not supported for speech model 'mars-flash'. Allowed languages are: ['en-us', 'en-in', 'zh-cn', ...]"
  }]
}

Advanced Customization

Fine-tune the audio with additional parameters to control the performance, style, and quality of the generated speech. These are sent in the same JSON payload.
  • speech_model: Specify the model for synthesis. Available values include mars-8.1-flash-beta, mars-8.1-pro-beta, mars-flash, mars-pro, and mars-instruct.
  • Expressive text tags: With mars-instruct, you can also embed delivery tags directly in the text (for example, emotion tags or SSML-style pauses) to shape pacing and tone.
  • output_configuration: Set the audio format (wav, mp3), sample rate, and toggle output enhancement.
    • apply_enhancement (boolean, optional): Applies output audio enhancement (loudness, denoising, polish). Defaults to true for most models, false for the speed-oriented mars-flash and mars-8.1-flash-beta models. Set explicitly to override.
  • voice_settings: Enhance reference audio quality, maintain the source accent, or adjust the speaking rate.
  • inference_options: Adjust stability, temperature, and speaker similarity for unique results.
The mars-8.1-flash-beta and mars-8.1-pro-beta models do not support the following parameters:
  • acoustic_quality_boost
  • temperature
  • speaker_similarity
  • maintain_source_accent
  • stability
  • output_enhancement
  • enhance_named_entities_pronunciation
  • localize_speaker_weight

MARS 8.1 Beta Text Controls

The mars-8.1-flash-beta and mars-8.1-pro-beta models support inline controls for English pronunciation and expressive non-verbal sounds. Add these controls directly in the text field.

Pronunciation Control (English)

Use CMU pronunciation dictionary phonemes in uppercase, wrapped in brackets, to override default English pronunciations.
payload = {
    "text": "He plays the [B EY1 S] guitar while catching a [B AE1 S] fish.",
    "language": "en-us",
    "voice_id": 147320,
    "speech_model": "mars-8.1-flash-beta"
}

Non-verbal Symbols

Insert supported tags directly in the text to add expressive non-verbal sounds.
payload = {
    "text": "[laughter] You really got me. I didn't see that coming at all.",
    "language": "en-us",
    "voice_id": 147320,
    "speech_model": "mars-8.1-flash-beta"
}
Supported tags: [laughter], [sigh], [confirmation], [question], [surprise], [dissatisfaction].

Expressive Text Tags (mars-instruct)

You can directly convey expression in the input text by adding short tags for delivery. For a deeper guide to emotion tag intensity, see the Emotion Tag Gradation Guide.
  • [speaking slowly] You need to understand this. It is very important. We should do this the right way.
  • [angry] You need to understand this! It is very important, we should do this the right way!
  • [gentle, reassuring] Take a deep breath. You're doing well. Let's go step by step.
  • Please pause here <break time="500ms"/> then continue in a calm, clear tone.
Keep tags short and place them near the sentence you want to influence.
For comprehensive examples and best practices, see the Emotional Voice Control tutorial.

Tips For Best Results:

  • For texts with numbers expand the numbers to words. For example, instead of “123” to “one hundred twenty three” or “one two three” as you need.
  • For code-switched sentences, perform transliteration to convert the text to your chosen TTS language. Both of above could be done by a small LLM.
  • To adjust pacing or approximate length, use voice_settings.speaking_rate. The streaming TTS endpoint does not support a duration parameter.

Output format support by model

Supported output_configuration.format values depend on the selected speech_model:
Speech ModelSupported output formats
mars-8.1-flash-betawav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be
mars-8.1-pro-betawav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be
mars-flashwav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be
mars-prowav, mp3, flac, adts, pcm_s16le, pcm_s16be, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be
mars-instructwav, flac, adts, pcm_s16le, pcm_s32be, pcm_s32le, pcm_f32le, pcm_f32be

Example: Real-time Audio Streaming

This example shows how to call the endpoint and save the incoming audio stream to a file.
import requests

payload = {
    "text": "Jupiter, the largest planet in our solar system, is a gas giant with swirling storms like the iconic Great Red Spot.",
    "language": "en-us",
    "voice_id": 147320,
    "speech_model": "mars-instruct",
    "enhance_named_entities_pronunciation": True,
    "output_configuration": {
        "format": "wav"
    },
    "voice_settings": {
        "enhance_reference_audio_quality": False,
        "maintain_source_accent": False,
        "speaking_rate": 1.0
    },
    "inference_options": {
        "inference_steps": 60,
    }
}

headers = {
    "x-api-key": "your-api-key"
}

response = requests.post(
    "https://client.camb.ai/apis/tts-stream",
    json=payload,
    headers=headers,
    stream=True
)

response.raise_for_status()

with open("output.wav", "wb") as audio_file:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            audio_file.write(chunk)

print("✨ Stream complete. Audio saved to output.wav")

SDK Example: Async Streaming

import asyncio
from camb.client import AsyncCambAI, save_async_stream_to_file
from camb.types.stream_tts_output_configuration import StreamTtsOutputConfiguration
from camb.types.stream_tts_voice_settings import StreamTtsVoiceSettings

# Initialize the async client
client = AsyncCambAI(api_key="your-api-key")

async def main():
    # Stream the TTS generation
    response = client.text_to_speech.tts(
        text="Experience high quality realistic sounds with Camb AI.",
        language="en-us",
        speech_model="mars-8.1-flash-beta",
        voice_id=<voice_id>,
        voice_settings=StreamTtsVoiceSettings(
            speaking_rate=1.0
        ),
        output_configuration=StreamTtsOutputConfiguration(
            format="wav"
        )
    )
    
    # Save the stream to a file (or process chunks as they arrive)
    await save_async_stream_to_file(response, "async_stream_output.wav")
    print("Audio stream saved to async_stream_output.wav")

if __name__ == "__main__":
    asyncio.run(main())

Streaming vs. Asynchronous: Which to Choose?

Select the right tool for your job by understanding the key differences between our TTS endpoints.

Use Streaming

Ideal for real-time, interactive experiences where immediate audio feedback is crucial.

Use Asynchronous

Perfect for non-real-time tasks, long-form content, or when you need to retrieve a complete audio file later.

Authorizations

x-api-key
string
header
required

The x-api-key is a custom header required for authenticating requests to our API. Include this header in your request with the appropriate API key value to securely access our endpoints. You can find your API key(s) in the 'API' section of our studio website.

Body

application/json

Streaming Text-to-Speech request parameters.

Request body for /tts-stream.

text
string
required

The text to synthesize into speech (3–3000 characters). For mars-8.1-flash-beta and mars-8.1-pro-beta, you can include inline controls such as CMU phonemes ([B EY1 S]) and non-verbal tags ([laughter]).

Required string length: 3 - 3000
Example:

"[laughter] He plays the [B EY1 S] guitar while catching a [B AE1 S] fish."

language
enum<string>
default:en-us
required

BCP-47 locale for the input text (for example, en-us).

Available options:
ro-ro,
nl-nl,
es-es,
zh-tw,
en-uk,
el-gr,
cs-cz,
vi-vn,
bn-bd,
ar-tn,
de-de,
fr-ca,
ar-xa,
th-th,
ar-eg,
ar-sa,
ar-sy,
pa-in,
zh-cn,
ar-jo,
ru-ru,
bn-in,
uk-ua,
es-us,
ja-jp,
ar-ae,
mr-in,
en-au,
de-ch,
pt-pt,
ar-kw,
ar-qa,
as-in,
hi-in,
fr-be,
fi-fi,
fr-fr,
ar-dz,
fr-ch,
it-it,
de-at,
en-in,
ko-kr,
en-us,
zh-hk,
ar-om,
ar-ma,
pl-pl,
ar-ly,
es-mx,
tr-tr,
ar-iq,
ar-lb,
ml-in,
pt-br,
id-id,
ar-bh,
kn-in,
nl-be,
te-in,
ar-ye,
ta-in,
af-za,
am-et,
az-az,
bg-bg,
bs-ba,
ca-es,
cy-gb,
da-dk,
en-ca,
en-gb,
en-hk,
en-ie,
en-ke,
en-ng,
en-nz,
en-ph,
en-sg,
en-tz,
en-za,
es-ar,
es-bo,
es-cl,
es-co,
es-cr,
es-cu,
es-do,
es-ec,
es-gq,
es-gt,
es-hn,
es-ni,
es-pa,
es-pe,
es-pr,
es-py,
es-sv,
es-uy,
es-ve,
et-ee,
eu-es,
fa-ir,
fil-ph,
ga-ie,
gl-es,
gu-in,
he-il,
hr-hr,
hu-hu,
hy-am,
is-is,
jv-id,
ka-ge,
kk-kz,
km-kh,
lo-la,
lt-lt,
lv-lv,
mk-mk,
mn-mn,
ms-my,
mt-mt,
my-mm,
nb-no,
ps-af,
si-lk,
sk-sk,
sl-si,
so-so,
sq-al,
sr-rs,
sv-se,
sw-ke,
sw-tz,
ta-lk,
ta-my,
ta-sg,
ur-in,
ur-pk,
uz-uz,
zh-cn-henan,
zh-cn-liaoning,
zh-cn-shaanxi,
zh-cn-shandong,
zh-cn-sichuan,
zu-za,
sa-in,
tl-ph,
es-xl,
or-in,
mai-in,
sd-in,
kok-in,
mni-in,
ks-in,
doi-in,
brx-in,
sat-in
Example:

"en-us"

voice_id
integer
default:147320
required

Voice profile ID to use for synthesis. Get available IDs from /list-voices.

Required range: x >= 1
Example:

147320

speech_model
enum<string>
default:mars-8.1-flash-beta

Speech model variant to use for synthesis. Use mars-8.1-flash-beta or mars-8.1-pro-beta to leverage inline pronunciation and non-verbal controls in text.

Available options:
mars-8.1-flash-beta,
mars-8.1-pro-beta,
mars-flash,
mars-pro,
mars-instruct
Example:

"mars-8.1-flash-beta"

enhance_named_entities_pronunciation
boolean
default:false

If true, improves pronunciation of names, brands, and other named entities.

Example:

true

output_configuration
StreamTTSOutputConfiguration · object

Controls output format and enhancement options for the stream.

Example:
{ "format": "wav" }
voice_settings
StreamTTSVoiceSettings · object

Voice behavior preferences such as accent preservation and reference enhancement.

Example:
{
"enhance_reference_audio_quality": false,
"maintain_source_accent": false,
"speaking_rate": 1.5
}

Response

Streaming audio response

Binary audio stream in WAV format.