Skip to main content
The MARS 8 family consists of zero-shot, multilingual Text-to-Speech (TTS) models designed to cover a wide range of production needs. Each model varies in latency, quality, controllability, and ideal use case, allowing you to choose the right fit for your application. Plus, MARS-Pro and MARS-Flash models are available on all major cloud providers.

MARS-Flash (22.05khz/48khz)

Ultra-low latency TTS for real-time agents and assistants
  • Parameters: 600M
  • TTFB: As low as 150 ms on certain GPUs like Blackwell
  • Primary Use Cases:
    • Agentic conversations
    • Call center agents
    • Live conversational assistants

MARS-Pro (48khz)

Balanced speed and fidelity for expressive real-time speech
  • Parameters: 600M
  • TTFB: 800 ms โ€“ 2 s
  • Primary Use Cases:
    • Real-time translation with voice and emotion transfer
    • Expressive dubbing
    • Audiobooks and digital media
  • Notes: Delivers best overall performance when speed is not the primary constraint, especially with short or challenging reference audio.

MARS-Instruct (22.05khz)

Director-level emotional and prosodic control
  • Parameters: 1.2B
  • TTFB: Higher latency (not intended for real-time use)
  • Primary Use Cases:
    • High-end TV and film production
    • Movie dubbing and post-production editing
  • Capabilities:
    • Independent control of speaker identity and prosody
    • Style and emotion can be tuned using both:
      • A reference audio sample
      • A textual description of desired prosody

MARS-Nano

Highly efficient TTS for on-device deployment
  • Parameters: 50M
  • TTFB: 500 ms โ€“ 2 s, depending on available compute
  • Primary Use Cases:
    • On-device applications
    • Environments with strict memory and compute constraints
  • Deployment Notes: Currently deployed with partners and providers such as Broadcom.

Tips For Best Results:

  • For texts with numbers expand the numbers to words. For example, instead of โ€œ123โ€ to โ€œone hundred twenty threeโ€ or โ€œone two threeโ€ as you need.
  • For code-switched sentences, perform transliteration to convert the text to your chosen TTS language. Weโ€™re improving the model to handle above nuances better, but we find that practically most LLM outputs feeding in already have the conversions. Weโ€™ve focused more on other parameters related to quality.

Advanced Customization

Fine-tune the audio with additional parameters to control the performance, style, and quality of the generated speech. These can be sent in the payload. More details available in the API Reference.
  • user_instructions: Guide the voiceโ€™s delivery (e.g., โ€œWarm, clear, and conversationalโ€). Only supported with mars-instruct.
  • output_configuration: Set the audio format (wav, mp3), and apply enhancements.
  • voice_settings: Enhance reference audio quality or maintain the source accent.
  • inference_options: Adjust stability, temperature, and speaker similarity for unique results.

Language Support

MARS-Flash, MARS-Pro, and MARS-Instruct are released across multiple languages, collectively covering 99% of the worldโ€™s speaking population.
  • en-us โ€” English (United States)
  • hi-in โ€” Hindi (India)
  • fr-fr โ€” French (France)
  • es-es โ€” Spanish (Spain)
  • de-de โ€” German
  • ja-jp โ€” Japanese
  • ar-xa โ€” Modern Standard Arabic
  • ko-kr โ€” Korean
  • zh-cn โ€” Chinese (Simplified)
  • it-it โ€” Italian
  • es-mx โ€” Spanish (Mexico)
  • pt-pt โ€” Portuguese (Portugal)
  • pt-br โ€” Portuguese (Brazil)
  • id-id โ€” Indonesian
  • nl-nl โ€” Dutch
  • ru-ru โ€” Russian
  • ar-sa โ€” Arabic (Saudi Arabia)
  • ta-in โ€” Tamil
  • te-in โ€” Telugu
  • bn-in โ€” Bengali (India)
  • ar-eg โ€” Arabic (Egypt)
  • ar-sy โ€” Arabic (Syria)
  • ar-ma โ€” Arabic (Morocco)
  • mr-in โ€” Marathi
  • kn-in โ€” Kannada
  • bn-bd โ€” Bengali (Bangladesh)
  • as-in โ€” Assamese
  • ml-in โ€” Malayalam
  • fr-ca โ€” French (Canada)
  • pl-pl โ€” Polish
  • tr-tr โ€” Turkish
  • pa-in โ€” Punjabi