--- language: en license: mit tags: - audio - emotion-recognition - valence-arousal - vision-transformer - pytorch - music-emotion-recognition datasets: - custom metrics: - mse - mae pipeline_tag: audio-classification --- # ViT for Audio Emotion Recognition (Valence-Arousal) This model is a fine-tuned Vision Transformer (ViT) for audio emotion recognition, predicting valence and arousal values in the continuous range of -1 to 1. ## Model Description - **Base Model**: google/vit-base-patch16-224-in21k - **Task**: Audio emotion recognition (regression) - **Output**: Valence and Arousal predictions (2D continuous emotion space) - **Range**: [-1, 1] for both dimensions - **Input**: Mel spectrogram images (224x224 RGB) ## Architecture ``` ViT Base (86M parameters) ↓ CLS Token Output (768-dim) ↓ LayerNorm + Dropout ↓ Linear (768 → 512) + GELU + Dropout ↓ Linear (512 → 128) + GELU + Dropout ↓ Linear (128 → 2) + Tanh ↓ [Valence, Arousal] ∈ [-1, 1]² ``` ## Usage ### Prerequisites ```bash pip install torch transformers librosa numpy pillow ``` ### Loading the Model ```python import torch from transformers import ViTModel import torch.nn as nn class ViTForEmotionRegression(nn.Module): def __init__(self, model_name='google/vit-base-patch16-224-in21k', num_emotions=2, dropout=0.1): super().__init__() self.vit = ViTModel.from_pretrained(model_name) hidden_size = self.vit.config.hidden_size self.head = nn.Sequential( nn.LayerNorm(hidden_size), nn.Dropout(dropout), nn.Linear(hidden_size, 512), nn.GELU(), nn.Dropout(dropout), nn.Linear(512, 128), nn.GELU(), nn.Dropout(dropout), nn.Linear(128, num_emotions), nn.Tanh() ) def forward(self, pixel_values): outputs = self.vit(pixel_values) cls_output = outputs.last_hidden_state[:, 0] return self.head(cls_output) # Load the model model = ViTForEmotionRegression() model.load_state_dict(torch.load('best_model.pth', map_location='cpu')) model.eval() ``` ### Audio Preprocessing ```python import librosa import numpy as np from PIL import Image import torch from torchvision import transforms def preprocess_audio(audio_path): # Load audio y, sr = librosa.load(audio_path, sr=22050, duration=30) # Generate mel spectrogram mel_spec = librosa.feature.melspectrogram( y=y, sr=sr, n_mels=128, hop_length=512, n_fft=2048 ) mel_db = librosa.power_to_db(mel_spec, ref=np.max) # Normalize to 0-255 for RGB conversion mel_normalized = ((mel_db - mel_db.min()) / (mel_db.max() - mel_db.min()) * 255).astype(np.uint8) # Convert to RGB image image = Image.fromarray(mel_normalized).convert('RGB') image = image.resize((224, 224)) # Apply ImageNet normalization transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) return transform(image).unsqueeze(0) # Process audio audio_tensor = preprocess_audio('your_audio.mp3') # Predict emotions with torch.no_grad(): predictions = model(audio_tensor) valence, arousal = predictions[0].tolist() print(f"Valence: {valence:.3f}, Arousal: {arousal:.3f}") ``` ### Emotion Quadrant Mapping ```python def classify_emotion(valence, arousal): if valence >= 0 and arousal >= 0: return "HAPPY" if valence > arousal else "EXCITED" elif valence >= 0 and arousal < 0: return "CALM" if abs(arousal) > valence else "CONTENT" elif valence < 0 and arousal < 0: return "SAD" if abs(valence) > abs(arousal) else "BORED" else: # valence < 0 and arousal >= 0 return "TENSE" if arousal > abs(valence) else "ANGRY" ``` ## Model Details - **Parameters**: ~86.8M - **Model Size**: ~331 MB - **Framework**: PyTorch - **Base Architecture**: ViT-Base (12 layers, 768 hidden, 12 heads) - **Custom Head**: 3-layer MLP with GELU activations - **Training Data**: Custom audio emotion dataset - **Training**: Fine-tuned with MSE loss on valence-arousal targets ## Emotion Space The model predicts emotions in the 2D circumplex model: ``` High Arousal | Angry Tense Excited | Sad -------- + -------- Happy | Bored Calm Content | Low Arousal ``` - **Valence**: Negative (unpleasant) ↔ Positive (pleasant) - **Arousal**: Low (calm) ↔ High (energetic) ## Performance The model outputs continuous predictions that can be: - Used directly for emotion intensity analysis - Mapped to discrete emotion categories - Visualized on emotion quadrant plots ## Limitations - Trained on music/audio, performance may vary on speech - Requires mel spectrogram preprocessing - Fixed 30-second audio duration (or first 30s) - Cultural bias depending on training data ## Citation ```bibtex @misc{sentio-vit-emotion, title={Vision Transformer for Audio Emotion Recognition}, author={SentioApp Team}, year={2025}, publisher={HuggingFace} } ``` ## License MIT License