ποΈ Whisper-SER (IEMOCAP + MEACorpus)
Multilingual Speech Emotion Recognition for Iberian Languages A Generative AI Approach with LLMs and Data Augmentation Techniques
π§ Overview
This repository hosts the model whisper-ser-iemocap_meacorpus.ckpt, released as part of the paper:
Bellver-Soler, J., Guragain, A., Ramos-Varela, S., CΓ³rdoba, R., & DβHaro, L.F. (2025) Multilingual Speech Emotion Recognition for Iberian Languages: A Generative AI Approach with LLMs and Data Augmentation Techniques. Universidad PolitΓ©cnica de Madrid (UPM) π SSRN Preprint
This model combines Whisper-large-v3 as a self-supervised acoustic encoder with attentive pooling and a frozen Bloomz-7B1 LLM classifier, trained for emotion recognition on the IEMOCAP and MEA-Corpus datasets.
π Model Description
| Component | Description |
|---|---|
| Audio Encoder | openai/whisper-large-v3 β frozen during training |
| Pooling | Single-head attentive pooling layer |
| Classifier | Frozen bigscience/bloomz-7b1 LLM used as classification head |
| Projection Layer | Linear adapter mapping acoustic embeddings to LLM hidden size |
| Loss Function | Weighted cross-entropy |
| Optimizer | AdamW with cosine LR schedule and warm-up |
| Sampling Rate | 16 kHz |
π― Task
Speech Emotion Recognition (SER) Predict the emotion expressed in an audio waveform among:
['neutral', 'happy', 'sad', 'angry', 'fear', 'surprise', 'disgust']
π§© Training Details
| Dataset | Language | Type | Hours | Notes |
|---|---|---|---|---|
| IEMOCAP | English | Acted | 7.0 | Standard benchmark |
| MEACorpus | Spanish | Natural | 13.2 | In-the-wild TV data |
- Stratified speaker-independent splits (train/val/test)
- 5 random seeds for robustness
- Validation metric: F1-macro (95% CI)
- Training in bfloat16 on 2Γ NVIDIA A100 (40GB)
π Performance
| Dataset | F1 (macro) | Augmentation |
|---|---|---|
| IEMOCAP | 0.719 | Spectrogram masking |
| MEACorpus | 0.786 | Mix-up |
| EMS (Spanish) | 0.783 | TTS Augmentation |
| VERBO (Portuguese) | 0.765 | Mix-up |
| AhoEmo3 (Basque) | 0.994 | Mix-up |
Frozen Bloomz-7B1 improves F1 by +4.9% over a linear MLP head.
𧬠Architecture
Audio β Whisper Encoder β Attentive Pooling β Linear Projection β Bloomz-7B1 (frozen) β Emotion logits
π§ͺ Usage Example
import torch
import torchaudio
# Load model checkpoint
model = torch.load("whisper-ser-iemocap_meacorpus.ckpt", map_location="cpu")
model.eval()
# Example audio
waveform, sr = torchaudio.load("example.wav")
assert sr == 16000, "Audio must be 16 kHz"
# Forward pass
with torch.no_grad():
logits = model(waveform)
probs = torch.nn.functional.softmax(logits, dim=-1)
emotion = probs.argmax(dim=-1)
print("Predicted emotion:", emotion.item())
π§Ύ Citation
If you use this model, please cite:
@article{bellversoler2025multilingual,
title={Multilingual Speech Emotion Recognition for Iberian Languages: A Generative AI Approach with LLMs and Data Augmentation Techniques},
author={Bellver-Soler, Jaime and Guragain, Anmol and Ramos-Varela, Samuel and CΓ³rdoba, Ricardo and DβHaro, Luis Fernando},
journal={Computer Speech & Language},
year={2025},
note={Preprint, SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5244228}
}
π Related Resources
| Resource | Description |
|---|---|
| π§ SpeechFactory (GitHub) | Codebase to generate synthetic emotional datasets |
| π§ SER-MSPMEA-Spanish (Hugging Face) | Synthetic Spanish emotional dataset generated via FishSpeech-TTS |
| MSP-MEA | Spanish extension of MSP-Podcast generated with voice cloning |
| Upcoming: IEMOCAP-MEA dataset | A new cross-lingual dataset combining IEMOCAP and MEACorpus recordings will be released soon, extending the current Whisper-SER model for multilingual benchmarking. |
π License
Released under CC BY 4.0. Attribution required for derivative works. Note: MEACorpus includes YouTube-sourced content β additional rights may apply.
π Acknowledgements
Supported by:
- European Commission β ASTOUND3 (101071191, Horizon Europe)
- MCIN/AEI/ERDF β Project BEWORD (PID2021-126061OB-C43)
- INNOVATRAD-CM β Comunidad de Madrid (PHS-2024/PH-HUM-52)
Authors: Jaime Bellver-Soler, Anmol Guragain, Samuel Ramos-Varela, Ricardo CΓ³rdoba, Luis Fernando DβHaro Speech Technology and Machine Learning Group, Universidad PolitΓ©cnica de Madrid
Model tree for jaimebellver/whisper-large-v3-SER
Base model
bigscience/bloom-7b1