π§ Emotion Sequence Transformer + BiLSTM MP478 Seq256
π Overview
This repository provides a Transformer + BiLSTM-based Emotion Recognition model trained on MediaPipe landmark sequences extracted from facial points. The model classifies human emotions into six categories: Angry, Disgust, Fear, Happy, Neutral, Sad.
It processes temporal sequences of 256 frames per clip with 478 landmarks per frame, learning the dynamic patterns of human expression. The model is optimized for real-time emotion inference and can be used in applications such as sign language understanding and emotion-aware human-computer interaction.
π§© Model Architecture
The model is built using Transformer layers for sequence modeling:
Input Layer:
- Accepts sequences of shape
(256, 478*3)corresponding to 3D coordinates of 478 landmarks over 256 frames.
- Accepts sequences of shape
Transformer Encoder Layers:
- Capture temporal dependencies and dynamic patterns of human motion using self-attention mechanisms.
Fully Connected Layers:
- Transform the encoder outputs into probabilities for six emotion classes.
Output Layer:
- Softmax activation for multi-class emotion classification.
π Dataset
Custom MediaPipe Landmark Dataset
- Extracted from labeled video clips representing six emotions.
- Preprocessing includes normalization, sequence grouping (256 frames per clip), and balanced augmentation.
- Total dataset is split into training, validation, and test sets.
βοΈ Training Configuration
| Parameter | Description |
|---|---|
| Architecture | Transformer |
| Sequence Length | 256 frames |
| Input Features | 478 landmarks Γ 3 coordinates |
| Optimizer | Adam |
| Learning Rate | 1e-4 |
| Loss Function | CrossEntropyLoss |
| Batch Size | 32 |
| Epochs | 60 |
π Performance Summary
| Metric | Score |
|---|---|
| Accuracy | 0.71 |
| Macro F1 | 0.70 |
| Weighted F1 | 0.71 |
Classification Report
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Angry | 0.78 | 0.63 | 0.70 | 139 |
| Disgust | 0.77 | 0.79 | 0.78 | 128 |
| Fear | 0.50 | 0.58 | 0.54 | 114 |
| Happy | 0.95 | 0.92 | 0.94 | 129 |
| Neutral | 0.61 | 0.81 | 0.69 | 101 |
| Sad | 0.64 | 0.52 | 0.58 | 134 |
πΌοΈ Visualizations
Multi-Class ROC Curves (AUC per class)
Multi-class ROC curves with AUC values.Confusion Matrix (Heatmap)
Confusion matrix heatmap on test set.
π§© Model Files
| File | Description |
|---|---|
emotion_sequence_transformer_mp478_seq256.pt |
Original PyTorch Transformer model |
emotion_sequence_transformer_mp478_seq256_weights.pt |
Original PyTorch Transformer model weights |
π Usage and Preprocessing
To correctly use this model for prediction, you must first preprocess your video data using the provided assets for standardization and label encoding.
1. Preprocessing Assets
The necessary files for video preprocessing are stored in the assets/ folder of this repository:
| File Name | Purpose | Required for Step |
|---|---|---|
emotion_label_encoder.joblib |
Maps predicted indices back to human-readable emotion labels (e.g., 0 -> 'Happy'). | Post-Inference |
global_mean_tensor.pt |
Global mean tensor used to normalize the extracted MediaPipe features. | Preprocessing |
global_std_tensor.pt |
Global standard deviation tensor used to normalize the extracted MediaPipe features. | Preprocessing |
You must load the mean tensor and std tensor to standardize your input feature sequences before feeding them into the BiLSTM model.
2. Complete Example
For a full, runnable demonstration showing how to load the model, use the assets for standardization, and run inference on a video, please refer to the usage notebook:
- Notebook:
emotion-sequence-transformer-bilstm-usage.ipynb
This file provides the complete code necessary to replicate the deployment environment.
π Key Features
- Real-time emotion recognition from MediaPipe landmarks
- Transformer-based sequence modeling for dynamic human motion
- Handles six primary emotion classes
π·οΈ Tags
emotion-recognition transformer sequential-data mediapipe human-emotion deep-learning pytorch torchscript affective-computing fine-tuning real-time
π€ Author & Model Info
Author: P.S. Abewickrama Singhe
Developed with: PyTorch
License: Apache-2.0
Date: October 2025
Evaluation results
- accuracy on Optimized 478-Point 3D Facial Landmark Datasetself-reported0.710
