🎙️ Whisper-SER (IEMOCAP + MEACorpus)

Multilingual Speech Emotion Recognition for Iberian Languages A Generative AI Approach with LLMs and Data Augmentation Techniques

🧠 Overview

This repository hosts the model whisper-ser-iemocap_meacorpus.ckpt, released as part of the paper:

Bellver-Soler, J., Guragain, A., Ramos-Varela, S., Córdoba, R., & D’Haro, L.F. (2025) Multilingual Speech Emotion Recognition for Iberian Languages: A Generative AI Approach with LLMs and Data Augmentation Techniques. Universidad Politécnica de Madrid (UPM) 📄 SSRN Preprint

This model combines Whisper-large-v3 as a self-supervised acoustic encoder with attentive pooling and a frozen Bloomz-7B1 LLM classifier, trained for emotion recognition on the IEMOCAP and MEA-Corpus datasets.

🚀 Model Description

Component	Description
Audio Encoder	`openai/whisper-large-v3` – frozen during training
Pooling	Single-head attentive pooling layer
Classifier	Frozen `bigscience/bloomz-7b1` LLM used as classification head
Projection Layer	Linear adapter mapping acoustic embeddings to LLM hidden size
Loss Function	Weighted cross-entropy
Optimizer	AdamW with cosine LR schedule and warm-up
Sampling Rate	16 kHz

🎯 Task

Speech Emotion Recognition (SER) Predict the emotion expressed in an audio waveform among:

['neutral', 'happy', 'sad', 'angry', 'fear', 'surprise', 'disgust']

🧩 Training Details

Dataset	Language	Type	Hours	Notes
IEMOCAP	English	Acted	7.0	Standard benchmark
MEACorpus	Spanish	Natural	13.2	In-the-wild TV data

Stratified speaker-independent splits (train/val/test)
5 random seeds for robustness
Validation metric: F1-macro (95% CI)
Training in bfloat16 on 2× NVIDIA A100 (40GB)

📈 Performance

Dataset	F1 (macro)	Augmentation
IEMOCAP	0.719	Spectrogram masking
MEACorpus	0.786	Mix-up
EMS (Spanish)	0.783	TTS Augmentation
VERBO (Portuguese)	0.765	Mix-up
AhoEmo3 (Basque)	0.994	Mix-up

Frozen Bloomz-7B1 improves F1 by +4.9% over a linear MLP head.

🧬 Architecture

Audio → Whisper Encoder → Attentive Pooling → Linear Projection → Bloomz-7B1 (frozen) → Emotion logits

🧪 Usage Example

import torch
import torchaudio

# Load model checkpoint
model = torch.load("whisper-ser-iemocap_meacorpus.ckpt", map_location="cpu")
model.eval()

# Example audio
waveform, sr = torchaudio.load("example.wav")
assert sr == 16000, "Audio must be 16 kHz"

# Forward pass
with torch.no_grad():
    logits = model(waveform)
    probs = torch.nn.functional.softmax(logits, dim=-1)
    emotion = probs.argmax(dim=-1)

print("Predicted emotion:", emotion.item())

🧾 Citation

If you use this model, please cite:

@article{bellversoler2025multilingual,
  title={Multilingual Speech Emotion Recognition for Iberian Languages: A Generative AI Approach with LLMs and Data Augmentation Techniques},
  author={Bellver-Soler, Jaime and Guragain, Anmol and Ramos-Varela, Samuel and Córdoba, Ricardo and D’Haro, Luis Fernando},
  journal={Computer Speech & Language},
  year={2025},
  note={Preprint, SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5244228}
}

📂 Related Resources

Resource	Description
🧠 SpeechFactory (GitHub)	Codebase to generate synthetic emotional datasets
🎧 SER-MSPMEA-Spanish (Hugging Face)	Synthetic Spanish emotional dataset generated via FishSpeech-TTS
MSP-MEA	Spanish extension of MSP-Podcast generated with voice cloning
Upcoming: IEMOCAP-MEA dataset	A new cross-lingual dataset combining IEMOCAP and MEACorpus recordings will be released soon, extending the current Whisper-SER model for multilingual benchmarking.

📜 License

Released under CC BY 4.0. Attribution required for derivative works. Note: MEACorpus includes YouTube-sourced content — additional rights may apply.

🙌 Acknowledgements

Supported by:

European Commission – ASTOUND3 (101071191, Horizon Europe)
MCIN/AEI/ERDF – Project BEWORD (PID2021-126061OB-C43)
INNOVATRAD-CM – Comunidad de Madrid (PHS-2024/PH-HUM-52)

Authors: Jaime Bellver-Soler, Anmol Guragain, Samuel Ramos-Varela, Ricardo Córdoba, Luis Fernando D’Haro Speech Technology and Machine Learning Group, Universidad Politécnica de Madrid

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for jaimebellver/whisper-large-v3-SER

Base model

bigscience/bloom-7b1

Finetuned

(3)

this model

jaimebellver
/

whisper-large-v3-SER