---
language: en
license: mit
tags:
  - audio
  - emotion-recognition
  - valence-arousal
  - vision-transformer
  - pytorch
  - music-emotion-recognition
datasets:
  - custom
metrics:
  - mse
  - mae
pipeline_tag: audio-classification
---

# ViT for Audio Emotion Recognition (Valence-Arousal)

This model is a fine-tuned Vision Transformer (ViT) for audio emotion recognition, predicting valence and arousal values in the continuous range of -1 to 1.

## Model Description

- **Base Model**: google/vit-base-patch16-224-in21k
- **Task**: Audio emotion recognition (regression)
- **Output**: Valence and Arousal predictions (2D continuous emotion space)
- **Range**: [-1, 1] for both dimensions
- **Input**: Mel spectrogram images (224x224 RGB)

## Architecture

```
ViT Base (86M parameters)
    ↓
CLS Token Output (768-dim)
    ↓
LayerNorm + Dropout
    ↓
Linear (768 → 512) + GELU + Dropout
    ↓
Linear (512 → 128) + GELU + Dropout
    ↓
Linear (128 → 2) + Tanh
    ↓
[Valence, Arousal] ∈ [-1, 1]²
```

## Usage

### Prerequisites

```bash
pip install torch transformers librosa numpy pillow
```

### Loading the Model

```python
import torch
from transformers import ViTModel
import torch.nn as nn

class ViTForEmotionRegression(nn.Module):
    def __init__(self, model_name='google/vit-base-patch16-224-in21k', num_emotions=2, dropout=0.1):
        super().__init__()
        self.vit = ViTModel.from_pretrained(model_name)
        hidden_size = self.vit.config.hidden_size
        
        self.head = nn.Sequential(
            nn.LayerNorm(hidden_size),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, 512),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(512, 128),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(128, num_emotions),
            nn.Tanh()
        )
    
    def forward(self, pixel_values):
        outputs = self.vit(pixel_values)
        cls_output = outputs.last_hidden_state[:, 0]
        return self.head(cls_output)

# Load the model
model = ViTForEmotionRegression()
model.load_state_dict(torch.load('best_model.pth', map_location='cpu'))
model.eval()
```

### Audio Preprocessing

```python
import librosa
import numpy as np
from PIL import Image
import torch
from torchvision import transforms

def preprocess_audio(audio_path):
    # Load audio
    y, sr = librosa.load(audio_path, sr=22050, duration=30)
    
    # Generate mel spectrogram
    mel_spec = librosa.feature.melspectrogram(
        y=y, sr=sr, n_mels=128, hop_length=512, n_fft=2048
    )
    mel_db = librosa.power_to_db(mel_spec, ref=np.max)
    
    # Normalize to 0-255 for RGB conversion
    mel_normalized = ((mel_db - mel_db.min()) / (mel_db.max() - mel_db.min()) * 255).astype(np.uint8)
    
    # Convert to RGB image
    image = Image.fromarray(mel_normalized).convert('RGB')
    image = image.resize((224, 224))
    
    # Apply ImageNet normalization
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    return transform(image).unsqueeze(0)

# Process audio
audio_tensor = preprocess_audio('your_audio.mp3')

# Predict emotions
with torch.no_grad():
    predictions = model(audio_tensor)
    valence, arousal = predictions[0].tolist()

print(f"Valence: {valence:.3f}, Arousal: {arousal:.3f}")
```

### Emotion Quadrant Mapping

```python
def classify_emotion(valence, arousal):
    if valence >= 0 and arousal >= 0:
        return "HAPPY" if valence > arousal else "EXCITED"
    elif valence >= 0 and arousal < 0:
        return "CALM" if abs(arousal) > valence else "CONTENT"
    elif valence < 0 and arousal < 0:
        return "SAD" if abs(valence) > abs(arousal) else "BORED"
    else:  # valence < 0 and arousal >= 0
        return "TENSE" if arousal > abs(valence) else "ANGRY"
```

## Model Details

- **Parameters**: ~86.8M
- **Model Size**: ~331 MB
- **Framework**: PyTorch
- **Base Architecture**: ViT-Base (12 layers, 768 hidden, 12 heads)
- **Custom Head**: 3-layer MLP with GELU activations
- **Training Data**: Custom audio emotion dataset
- **Training**: Fine-tuned with MSE loss on valence-arousal targets

## Emotion Space

The model predicts emotions in the 2D circumplex model:

```
        High Arousal
             |
    Angry  Tense  Excited
             |
Sad -------- + -------- Happy
             |
    Bored  Calm  Content
             |
         Low Arousal
```

- **Valence**: Negative (unpleasant) ↔ Positive (pleasant)
- **Arousal**: Low (calm) ↔ High (energetic)

## Performance

The model outputs continuous predictions that can be:
- Used directly for emotion intensity analysis
- Mapped to discrete emotion categories
- Visualized on emotion quadrant plots

## Limitations

- Trained on music/audio, performance may vary on speech
- Requires mel spectrogram preprocessing
- Fixed 30-second audio duration (or first 30s)
- Cultural bias depending on training data

## Citation

```bibtex
@misc{sentio-vit-emotion,
  title={Vision Transformer for Audio Emotion Recognition},
  author={SentioApp Team},
  year={2025},
  publisher={HuggingFace}
}
```

## License

MIT License