🧠 Emotion Sequence Transformer + BiLSTM MP478 Seq256

📘 Overview

This repository provides a Transformer + BiLSTM-based Emotion Recognition model trained on MediaPipe landmark sequences extracted from facial points. The model classifies human emotions into six categories: Angry, Disgust, Fear, Happy, Neutral, Sad.

It processes temporal sequences of 256 frames per clip with 478 landmarks per frame, learning the dynamic patterns of human expression. The model is optimized for real-time emotion inference and can be used in applications such as sign language understanding and emotion-aware human-computer interaction.

🧩 Model Architecture

The model is built using Transformer layers for sequence modeling:

Input Layer:
- Accepts sequences of shape (256, 478*3) corresponding to 3D coordinates of 478 landmarks over 256 frames.
Transformer Encoder Layers:
- Capture temporal dependencies and dynamic patterns of human motion using self-attention mechanisms.
Fully Connected Layers:
- Transform the encoder outputs into probabilities for six emotion classes.
Output Layer:
- Softmax activation for multi-class emotion classification.

📊 Dataset

Custom MediaPipe Landmark Dataset

Extracted from labeled video clips representing six emotions.
Preprocessing includes normalization, sequence grouping (256 frames per clip), and balanced augmentation.
Total dataset is split into training, validation, and test sets.

⚙️ Training Configuration

Parameter	Description
Architecture	Transformer
Sequence Length	256 frames
Input Features	478 landmarks × 3 coordinates
Optimizer	Adam
Learning Rate	1e-4
Loss Function	CrossEntropyLoss
Batch Size	32
Epochs	60

📈 Performance Summary

Metric	Score
Accuracy	0.71
Macro F1	0.70
Weighted F1	0.71

Classification Report

Class	Precision	Recall	F1-Score	Support
Angry	0.78	0.63	0.70	139
Disgust	0.77	0.79	0.78	128
Fear	0.50	0.58	0.54	114
Happy	0.95	0.92	0.94	129
Neutral	0.61	0.81	0.69	101
Sad	0.64	0.52	0.58	134

🖼️ Visualizations

Confusion Matrix Training and validation accuracy and loss.
Multi-Class ROC Curves (AUC per class) Multi-class ROC curves with AUC values.
Confusion Matrix (Heatmap) Confusion matrix heatmap on test set.

🧩 Model Files

File	Description
`emotion_sequence_transformer_mp478_seq256.pt`	Original PyTorch Transformer model
`emotion_sequence_transformer_mp478_seq256_weights.pt`	Original PyTorch Transformer model weights

🚀 Usage and Preprocessing

To correctly use this model for prediction, you must first preprocess your video data using the provided assets for standardization and label encoding.

1. Preprocessing Assets

The necessary files for video preprocessing are stored in the assets/ folder of this repository:

File Name	Purpose	Required for Step
`emotion_label_encoder.joblib`	Maps predicted indices back to human-readable emotion labels (e.g., 0 -> 'Happy').	Post-Inference
`global_mean_tensor.pt`	Global mean tensor used to normalize the extracted MediaPipe features.	Preprocessing
`global_std_tensor.pt`	Global standard deviation tensor used to normalize the extracted MediaPipe features.	Preprocessing

You must load the mean tensor and std tensor to standardize your input feature sequences before feeding them into the BiLSTM model.

2. Complete Example

For a full, runnable demonstration showing how to load the model, use the assets for standardization, and run inference on a video, please refer to the usage notebook:

Notebook: emotion-sequence-transformer-bilstm-usage.ipynb

This file provides the complete code necessary to replicate the deployment environment.

🚀 Key Features

Real-time emotion recognition from MediaPipe landmarks
Transformer-based sequence modeling for dynamic human motion
Handles six primary emotion classes

🏷️ Tags

emotion-recognition transformer sequential-data mediapipe human-emotion deep-learning pytorch torchscript affective-computing fine-tuning real-time

👤 Author & Model Info

Author: P.S. Abewickrama Singhe
Developed with: PyTorch
License: Apache-2.0
Date: October 2025

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

accuracy on Optimized 478-Point 3D Facial Landmark Dataset
self-reported

0.710

View on Papers With Code