i3d_ucf_finetuned / README.md
Ahmeddawood0001's picture
Update README.md
86783ec verified
---
license: mit
tags:
- video-classification
- I3D
- action-recognition
- anomaly-detection
datasets:
- kinetics-400
- ucf-crime
model-index:
- name: i3d_ucf_finetuned
results:
- task:
type: video-classification
dataset:
name: UCF-Crime
type: ucf-crime
metrics:
- name: Validation Accuracy
type: accuracy
value: 0.6667
---
# I3D UCF Finetuned
## Model Description
This is a finetuned I3D (Inflated 3D ConvNet) model for video classification, based on the `i3d_r50` architecture from [PyTorchVideo](https://pytorchvideo.org/). The I3D model uses a ResNet-50 backbone inflated to 3D convolutions to capture both spatial and temporal features from videos. It was originally pretrained on the **Kinetics-400** dataset, which contains ~306,245 short videos across 400 human action classes (e.g., running, dancing, cooking).
The model was finetuned on the **UCF-Crime** dataset to classify videos into 8 specific categories: `arrest`, `Explosion`, `Fight`, `normal`, `roadaccidents`, `shooting`, `Stealing`, `vandalism`. During finetuning, the final fully connected layer was modified to output 8 classes, and a Dropout layer (p=0.3) was added to reduce overfitting. The finetuned weights are stored in `i3d_ucf_finetuned.pth` (109 MB) and can be downloaded from this repository.
## Dataset
### Pretraining Dataset
- **Kinetics-400**: A large-scale dataset with ~306,245 videos covering 400 human action classes. It provides robust general features for video understanding, making it an excellent starting point for finetuning.
### Finetuning Dataset
- **UCF-Crime**: A dataset for anomaly detection in videos, containing ~1,900 videos (~1,610 for training, 290 for testing). The model was finetuned on a subset of UCF-Crime to classify videos into 8 categories: `arrest`, `Explosion`, `Fight`, `normal`, `roadaccidents`, `shooting`, `Stealing`, `vandalism`.
## Performance
The model was finetuned for 30 epochs. Below are the training and validation performance plots:
### Training and Validation Accuracy
![Training and Validation Accuracy](train_val_accuracy.jpg)
- **Best Validation Accuracy**: ~66.67% (achieved after finetuning on UCF-Crime).
- **Training Accuracy**: Reached ~81.03% .
### Training and Validation Loss
![Training and Validation Loss](train_val_loss.jpg)
- The training loss decreases steadily, while the validation loss shows some fluctuations, indicating potential room for improving generalization.
## Usage
To use the model for video classification, you can load the weights from this repository using the following code:
```python
import torch
import cv2
import numpy as np
import torch.nn as nn
from huggingface_hub import hf_hub_download
# Define the model
def load_i3d_ucf_finetuned(repo_id="Ahmeddawood0001/i3d_ucf_finetuned", filename="i3d_ucf_finetuned.pth"):
class I3DClassifier(nn.Module):
def __init__(self, num_classes):
super(I3DClassifier, self).__init__()
self.i3d = torch.hub.load('facebookresearch/pytorchvideo', 'i3d_r50', pretrained=True)
self.dropout = nn.Dropout(0.3)
self.i3d.blocks[6].proj = nn.Linear(2048, num_classes)
def forward(self, x):
x = self.i3d(x)
x = self.dropout(x)
return x
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = I3DClassifier(num_classes=8).to(device)
weights_path = hf_hub_download(repo_id=repo_id, filename=filename)
model.load_state_dict(torch.load(weights_path))
model.eval()
return model
# Define frame extraction function
def extract_frames(video_path, max_frames=32, frame_size=(224, 224)):
cap = cv2.VideoCapture(video_path)
frames = []
while len(frames) < max_frames:
ret, frame = cap.read()
if not ret:
break
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frame = cv2.resize(frame, frame_size)
frames.append(frame)
while len(frames) < max_frames:
frames.append(frames[-1])
frames = frames[:max_frames]
frames = np.stack(frames)
frames = torch.from_numpy(frames).permute(0, 3, 1, 2).float() / 255.0
frames = frames.permute(1, 0, 2, 3)
cap.release()
return frames
# Define classification function
def classify_video(video_path, model, labels):
frames = extract_frames(video_path)
frames = frames.unsqueeze(0).to(device)
with torch.no_grad():
outputs = model(frames)
probabilities = torch.softmax(outputs, dim=1)
predicted_idx = torch.argmax(probabilities, dim=1).item()
predicted_label = labels[predicted_idx]
confidence = probabilities[0, predicted_idx].item()
return predicted_label, confidence
# Example usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
labels = ["arrest", "Explosion", "Fight", "normal", "roadaccidents", "shooting", "Stealing", "vandalism"]
model = load_i3d_ucf_finetuned()
video_path = "path/to/your/video.mp4" # Replace with your video path
predicted_label, confidence = classify_video(video_path, model, labels)
print(f"Video: {video_path}")
print(f"Predicted Label: {predicted_label}")
print(f"Confidence: {confidence:.4f}")