Create Voice from Description

This endpoint enables you to generate custom, human-like synthetic voices based on your descriptive text prompts. Rather than selecting from pre-defined voice options, this innovative approach allows you to craft voices tailored to your specific needs by simply describing the voice characteristics you want. The endpoint initiates an asynchronous process, returning a task_id that you can use to monitor the generation progress and eventually retrieve your custom voice.

A description consisting of at least 18 words (100+ characters) is required for voice generation. Requests with shorter prompts will not be processed successfully.

How Voice Generation Works

The voice generation process follows these steps:

You submit a detailed description of the voice you want to create, along with sample text for the voice to speak.
The system analyzes your description and generates three sample voices matching those characteristics for you to choose from.
The system returns a task_id that you can use to track the generation process.
You periodically check the status using the /text-to-voice/{task_id} endpoint.
Once complete, you can access and use the generated voice in your applications.

Creating Effective Voice Descriptions

The quality and specificity of your voice description directly impacts the resulting voice. When crafting your description, consider including details about:

Gender and age range: “A middle-aged woman” or “An elderly man”
Accent and regional characteristics: “With a mild Scottish accent” or “Speaking American English with Southern inflections”
Emotional qualities: “A warm, nurturing tone” or “An authoritative, confident delivery”
Speaking style: “Who speaks slowly and deliberately” or “With an energetic, rapid-fire delivery”
Cultural context: “A voice that would be at home narrating documentaries” or “Like a friendly teacher explaining concepts”
Vocal characteristics: “With a slightly raspy quality” or “With a deep, resonant tone”

The more detailed and vivid your description, the more precisely the system can match your desired voice characteristics. Remember that your description must contain at least 18 words (100+ characters) to provide sufficient guidance for the voice generation system.

Example Request

{
  "text": "Welcome to our application. I'll be your guide through all the features and capabilities available to you.",
  "voice_description": "A warm and friendly middle-aged woman with a slight British accent. She speaks clearly and articulately, with a soothing tone that conveys expertise and trustworthiness. Her voice has a natural musical quality without being overly dramatic."
}

Response

Upon successful submission, the endpoint returns a task_id that you can use to check the status of your voice generation task:

{
  "task_id": "your_task_id"
  "status": "PENDING"
}

Monitoring Generation Progress

Voice generation is a computationally intensive process that typically takes some time to complete. To check the status of your generation task, periodically poll the /text-to-voice/{task_id} endpoint using the task_id received in the initial response.

Best Practices

Be specific in your descriptions: The more detailed your voice description, the better the system can match your expectations.
Consider the context: Tailor your voice to match the content and audience of your application.
Start with longer descriptions: While 18 words is the minimum, starting with more detailed descriptions (30-50 words) often yields better results.
Test variations: If your first voice isn’t exactly what you need, try adjusting specific aspects of your description to refine the results.
Include emotional context: Describing the emotional quality of the voice significantly improves the naturalness of the generated speech.

Limitations

Voice descriptions must be at least 18 words (100+ characters) long.
Very unusual or contradictory voice descriptions may yield unpredictable results.

By leveraging this endpoint effectively, you can create custom voices that perfectly match your brand identity, content needs, and user expectations, all without the need for professional voice talent or recording studios.

Authorizations

x-api-key

string

header

required

The x-api-key is a custom header required for authenticating requests to our API. Include this header in your request with the appropriate API key value to securely access our endpoints. You can find your API key(s) in the 'API' section of our studio website.

Body

application/json

Response

200

application/json

Successful response

A JSON that contains the unique identifier for the task. This is used to query the status of the text to voice task that is running. It is returned when a create request is made for creating a text to voice task.

INTRODUCTION

API ENDPOINTS

ADMINISTRATION

Create Voice from Description

How Voice Generation Works

Creating Effective Voice Descriptions

Example Request

Response

Monitoring Generation Progress

Best Practices

Limitations

Authorizations

Body

Response

INTRODUCTION

API ENDPOINTS

ADMINISTRATION

​How Voice Generation Works

​Creating Effective Voice Descriptions

​Example Request

​Response

​Monitoring Generation Progress

​Best Practices

​Limitations

Authorizations

Body

Response

How Voice Generation Works

Creating Effective Voice Descriptions

Example Request

Response

Monitoring Generation Progress

Best Practices

Limitations