πŸŽ™οΈ Whisper-SER (IEMOCAP + MEACorpus)

Multilingual Speech Emotion Recognition for Iberian Languages A Generative AI Approach with LLMs and Data Augmentation Techniques


🧠 Overview

This repository hosts the model whisper-ser-iemocap_meacorpus.ckpt, released as part of the paper:

Bellver-Soler, J., Guragain, A., Ramos-Varela, S., CΓ³rdoba, R., & D’Haro, L.F. (2025) Multilingual Speech Emotion Recognition for Iberian Languages: A Generative AI Approach with LLMs and Data Augmentation Techniques. Universidad PolitΓ©cnica de Madrid (UPM) πŸ“„ SSRN Preprint

This model combines Whisper-large-v3 as a self-supervised acoustic encoder with attentive pooling and a frozen Bloomz-7B1 LLM classifier, trained for emotion recognition on the IEMOCAP and MEA-Corpus datasets.


πŸš€ Model Description

Component Description
Audio Encoder openai/whisper-large-v3 – frozen during training
Pooling Single-head attentive pooling layer
Classifier Frozen bigscience/bloomz-7b1 LLM used as classification head
Projection Layer Linear adapter mapping acoustic embeddings to LLM hidden size
Loss Function Weighted cross-entropy
Optimizer AdamW with cosine LR schedule and warm-up
Sampling Rate 16 kHz

🎯 Task

Speech Emotion Recognition (SER) Predict the emotion expressed in an audio waveform among:

['neutral', 'happy', 'sad', 'angry', 'fear', 'surprise', 'disgust']


🧩 Training Details

Dataset Language Type Hours Notes
IEMOCAP English Acted 7.0 Standard benchmark
MEACorpus Spanish Natural 13.2 In-the-wild TV data
  • Stratified speaker-independent splits (train/val/test)
  • 5 random seeds for robustness
  • Validation metric: F1-macro (95% CI)
  • Training in bfloat16 on 2Γ— NVIDIA A100 (40GB)

πŸ“ˆ Performance

Dataset F1 (macro) Augmentation
IEMOCAP 0.719 Spectrogram masking
MEACorpus 0.786 Mix-up
EMS (Spanish) 0.783 TTS Augmentation
VERBO (Portuguese) 0.765 Mix-up
AhoEmo3 (Basque) 0.994 Mix-up

Frozen Bloomz-7B1 improves F1 by +4.9% over a linear MLP head.


🧬 Architecture

Audio β†’ Whisper Encoder β†’ Attentive Pooling β†’ Linear Projection β†’ Bloomz-7B1 (frozen) β†’ Emotion logits

πŸ§ͺ Usage Example

import torch
import torchaudio

# Load model checkpoint
model = torch.load("whisper-ser-iemocap_meacorpus.ckpt", map_location="cpu")
model.eval()

# Example audio
waveform, sr = torchaudio.load("example.wav")
assert sr == 16000, "Audio must be 16 kHz"

# Forward pass
with torch.no_grad():
    logits = model(waveform)
    probs = torch.nn.functional.softmax(logits, dim=-1)
    emotion = probs.argmax(dim=-1)

print("Predicted emotion:", emotion.item())

🧾 Citation

If you use this model, please cite:

@article{bellversoler2025multilingual,
  title={Multilingual Speech Emotion Recognition for Iberian Languages: A Generative AI Approach with LLMs and Data Augmentation Techniques},
  author={Bellver-Soler, Jaime and Guragain, Anmol and Ramos-Varela, Samuel and CΓ³rdoba, Ricardo and D’Haro, Luis Fernando},
  journal={Computer Speech & Language},
  year={2025},
  note={Preprint, SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5244228}
}

πŸ“‚ Related Resources

Resource Description
🧠 SpeechFactory (GitHub) Codebase to generate synthetic emotional datasets
🎧 SER-MSPMEA-Spanish (Hugging Face) Synthetic Spanish emotional dataset generated via FishSpeech-TTS
MSP-MEA Spanish extension of MSP-Podcast generated with voice cloning
Upcoming: IEMOCAP-MEA dataset A new cross-lingual dataset combining IEMOCAP and MEACorpus recordings will be released soon, extending the current Whisper-SER model for multilingual benchmarking.

πŸ“œ License

Released under CC BY 4.0. Attribution required for derivative works. Note: MEACorpus includes YouTube-sourced content β€” additional rights may apply.


πŸ™Œ Acknowledgements

Supported by:

  • European Commission – ASTOUND3 (101071191, Horizon Europe)
  • MCIN/AEI/ERDF – Project BEWORD (PID2021-126061OB-C43)
  • INNOVATRAD-CM – Comunidad de Madrid (PHS-2024/PH-HUM-52)

Authors: Jaime Bellver-Soler, Anmol Guragain, Samuel Ramos-Varela, Ricardo CΓ³rdoba, Luis Fernando D’Haro Speech Technology and Machine Learning Group, Universidad PolitΓ©cnica de Madrid

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jaimebellver/whisper-large-v3-SER

Finetuned
(3)
this model

Dataset used to train jaimebellver/whisper-large-v3-SER