Amazing demo — would you mind sharing how you achieved that level of expressiveness?
Hi Nihal,
First of all, your TTS demo is incredible — the expressiveness and natural flow are really impressive.
I was curious: did you build a custom SSML layer or prompt formatting to control emotion/style, or is it handled natively by the API you’re using?
Also, I saw your mention of gpt-4o-mini-tts — is that an internal endpoint, or just a naming convention for your setup?
Would love to understand how close this is to being usable in real-time dialogue apps.
Thanks again for the amazing work!
Wandel
Hi Nihal,
First of all, your TTS demo is incredible — the expressiveness and natural flow are really impressive.
I was curious: did you build a custom SSML layer or prompt formatting to control emotion/style, or is it handled natively by the API you’re using?
Also, I saw your mention of gpt-4o-mini-tts — is that an internal endpoint, or just a naming convention for your setup?
Would love to understand how close this is to being usable in real-time dialogue apps.
Thanks again for the amazing work!
Wandel
Hey there, @wandelrocha , thank you so much for the kind words! I really appreciate it!
And sure! I would love to explain!
OpenAI has their own model, named openai-audio, which is essentially a multi-modal, multi-lingual, stylized text-to-speech model combined with a conversational LM. It's not actually Text-to-Speech model but an audio2audio conversational LM. The expressiveness and the flow is actually inherent in openai-audio's Speech Synthesis model.
What I did, is, that I prompt-engineered it to repeat whatever the user says in the prompt in the given emotional style as specified. Again, it's not made entirely from scratch, neither I trained the model entirely. I just zero-shot trained it using prompt-engineering/prompt-formatting.
Hi @NihalGazi ,
Thank you for the detailed explanation — that clarifies a lot!
I find your approach brilliant: using prompt-engineering alone to steer emotional tone and flow in openai-audio is both elegant and practical. The fact that you achieved such expressiveness without a dedicated SSML layer shows just how powerful this model is when guided skillfully.
It also opens up exciting possibilities for dialogue-based applications, especially when aiming for real-time interactions that feel emotionally authentic.
If you ever share more about your formatting strategies or structure for zero-shot prompting, I’d be thrilled to dive deeper.
Congrats again on the outstanding work — truly inspiring!
Best,
Wandel