Skip to main content

Overview

Transform text from information to experience. With mars-instruct, you can craft speech that captures subtle emotional states, dramatic pacing, and conversational dynamics. Not just reading text, but performing it. mars-instruct gives you two ways to control expression:
  1. Embedded emotion tags - Add cues like [happy], [sad], or sound effects like [laughing], [sighing]
  2. user_instructions parameter - Provide broader tone guidance like ā€œSpeak in an excited, upbeat toneā€

Hear the Difference


Emotion Tags

Emotion Tone Tags

For emotional tone (happy, sad, angry), use tags with user_instructions and match your text content to the emotion:
TagExample Textuser_instructions
[happy]ā€We won the match! This is the best day ever!""happy, excited, celebratingā€
[sad]ā€I… I don’t know if I can do this anymore…""sad, melancholicā€
Important: The text content and punctuation must match the emotion for best results.

Sound Effect Tags

Sound effect tags go within your sentence where the action naturally occurs:
TagExampleNotes
[laughing]ā€That’s ridiculous! [laughing] I can’t believe that!ā€Produces laughter sound
[sighing]ā€I guess we have to start over. [sighing] Alright, let’s begin.ā€Produces sigh sound
ahem ahemā€So what I was going to say is… ahem ahem… never mind.ā€Produces throat-clearing sound

user_instructions

The user_instructions parameter provides broader tone guidance for your entire speech.
InstructionEffect
shouting, angry, threateningAgitated, confrontational delivery
whispering, secretiveQuiet, intimate delivery
empathetic, helpfulCaring, supportive delivery
happy, excited, promotionalUpbeat, promotional delivery
patient, teachingEducational, measured delivery

Combining Both Methods

For precise control, combine user_instructions with embedded emotion tags:
response = client.text_to_speech.tts(
    text="[sighing] I have a secret to tell you... [happy] We're going to Paris!",
    voice_id=147320,
    language="en-us",
    speech_model="mars-instruct",
    user_instructions="emotional shifts from sad to excited",
    output_configuration=StreamTtsOutputConfiguration(format="wav")
)

Pauses

Add SSML-style breaks anywhere in your text for dramatic pauses:
You... must... understand... this. <break time='600ms'/> The future begins NOW.


Best Practices

  1. Combine methods - Use both user_instructions and embedded tags for best results
  2. Match content to emotion - Text and punctuation should reflect the emotional tone
  3. Place sound effects naturally - Tags like [laughing], [sighing] work best within sentences
  4. Use emotions with instructions - Tags like [happy], [sad] need user_instructions to work well
  5. Add pauses - Use <break time='600ms'/> for dramatic effect

Next Steps