One-person companies especially the bootstrapped ones, always struggle with video ads. Amongst the most nagging problems is generating speech for an ad. It’s not like this problem is only specific to bootstrapped startups. Marketers in general have to deal with the speech thing for their ads, especially when the characters in the ad are generated with AI. Speech is the most powerful medium through which marketers convince the masses to buy their stuff.
Since AI is gobbling up most of our work, speech has not been overlooked. Cartesia is dedicated to building the most speech-centric AI in the automation space.
In this tutorial, we’ll explore the all-new Sonic 3 model that transforms the written text into the most realistic and emotional speech. We’ll show you how to access the Sonic 3 model in Cartesia, write the text for speech with emotional sounds, clone anyone’s voice, and adjust the speed and volume in real time.
By the end of this tutorial, you’ll be able to:
- Access the Sonic 3 model
- Write the text for a speech with emotional sounds
- Clone anyone’s voice
- Adjust the speed and volume of the speech in real time.
Let’s dive right into it!
Step 1 - Access the Sonic 3 model
Sonic 3 is the newest model of Cartesia. It generates speech with emotions just like a normal human being.
Visit Cartesia and log in to an account or sign up for a new one.

After signing in, you will be transported to the main page where the real action happens - the AI playground.

Check the model in the controls side pane. Change it to ‘Sonic 3’. Select the voice for the speech. Keep the speed and volume at default and select an emotion to go with the speech.

Step 2 - Write the text for a speech with emotional sounds
Here comes the exciting part. Write a text that shows a speech with emotional tags. Sonic 3 will convert the text into speech, sounding the emotional cues like laughter or a weeping voice. Here’s an example of a text with emotional tags. Click ‘Speak.’
Text prompt:
Why do you never see elephants hiding in trees? <break time="600ms"/> because they are really good at it.[laughter]

Let’s try something else. A block of text that actually reads like a narration.
Text prompt:
Friends, I feel both excitement and caution about AI <emotion value="happy" />. It writes our emails faster, spots disease earlier, and gives lonely nights a voice <emotion value="happy"/>. Yet it studies us too closely, steers our choices, and sometimes speaks nonsense with perfect confidence [laughter]. I have seen a founder reclaim dinners with family, and a worker carry a final box while a dashboard cheered productivity <break time="600ms"/>. The promise is real, the peril is real, and the difference will be our choices, not the code <break time="800ms"/>. Let us demand transparency, protect dignity, and measure progress by people lifted, not just profits counted. If we do this well, our tools will make us more human. If we fail, we wake to a world that is very smart and strangely empty <break time="600ms"/>. Let us choose wisely together <emotion value="excited"/>.

Step 3 - Clone anyone’s voice
Another great feature of Cartesia Sonic 3 is cloning a voice. You can record a voice of your own or upload a voice sample. It will parse the voice sample. You can use it for a TTS voiceover.
Click ‘Instant clone’ in the navigation bar on the left.

Next, click the link under ‘Input’ to upload a voice sample or record a voice.

Select a 10-second snippet in the uploaded voice sample. Provide details, description, and language, and click ‘Clone.’

Cartesia will clone the voice instantly. You can check the voice output by writing a text that is to be converted into speech using the voice you uploaded.

Step 4 - Adjust the speed and volume of the speech in real time
You can also adjust the speed and volume of the speech while the text is being spoken. Just click ‘Speak’ and adjust the volume and the speed in real time.

That’s it for this tutorial, voice nerds! It's all about finding the right tone, voice, volume, and speed of the voice. It takes a bit of time to get to the right voice. But once you get to it, you can use the signature voice in your ads or any other marketing material.
.avif)





.png)
