Spaces:

NihalGazi
/

Text-To-Speech-Unlimited

Running

Amazing demo — would you mind sharing how you achieved that level of expressiveness?

#22

by wandelrocha - opened Jul 20

Jul 20

Hi Nihal,
First of all, your TTS demo is incredible — the expressiveness and natural flow are really impressive.
I was curious: did you build a custom SSML layer or prompt formatting to control emotion/style, or is it handled natively by the API you’re using?
Also, I saw your mention of gpt-4o-mini-tts — is that an internal endpoint, or just a naming convention for your setup?
Would love to understand how close this is to being usable in real-time dialogue apps.
Thanks again for the amazing work!
Wandel

NihalGazi

Owner Jul 20

Hi Nihal,
First of all, your TTS demo is incredible — the expressiveness and natural flow are really impressive.
I was curious: did you build a custom SSML layer or prompt formatting to control emotion/style, or is it handled natively by the API you’re using?
Also, I saw your mention of gpt-4o-mini-tts — is that an internal endpoint, or just a naming convention for your setup?
Would love to understand how close this is to being usable in real-time dialogue apps.
Thanks again for the amazing work!
Wandel

Hey there, @wandelrocha , thank you so much for the kind words! I really appreciate it!

And sure! I would love to explain!

OpenAI has their own model, named openai-audio, which is essentially a multi-modal, multi-lingual, stylized text-to-speech model combined with a conversational LM. It's not actually Text-to-Speech model but an audio2audio conversational LM. The expressiveness and the flow is actually inherent in openai-audio's Speech Synthesis model.

What I did, is, that I prompt-engineered it to repeat whatever the user says in the prompt in the given emotional style as specified. Again, it's not made entirely from scratch, neither I trained the model entirely. I just zero-shot trained it using prompt-engineering/prompt-formatting.

wandelrocha

Jul 20

Hi @NihalGazi ,

Thank you for the detailed explanation — that clarifies a lot!

I find your approach brilliant: using prompt-engineering alone to steer emotional tone and flow in openai-audio is both elegant and practical. The fact that you achieved such expressiveness without a dedicated SSML layer shows just how powerful this model is when guided skillfully.

It also opens up exciting possibilities for dialogue-based applications, especially when aiming for real-time interactions that feel emotionally authentic.

If you ever share more about your formatting strategies or structure for zero-shot prompting, I’d be thrilled to dive deeper.

Congrats again on the outstanding work — truly inspiring!

Best,
Wandel

wandelrocha changed discussion status to closed Jul 20

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment