i3d_ucf_finetuned / README.md

Update README.md

86783ec verified 7 months ago

5.29 kB

	---
	license: mit
	tags:
	- video-classification
	- I3D
	- action-recognition
	- anomaly-detection
	datasets:
	- kinetics-400
	- ucf-crime
	model-index:
	- name: i3d_ucf_finetuned
	results:
	- task:
	type: video-classification
	dataset:
	name: UCF-Crime
	type: ucf-crime
	metrics:
	- name: Validation Accuracy
	type: accuracy
	value: 0.6667
	---

	# I3D UCF Finetuned

	## Model Description
	This is a finetuned I3D (Inflated 3D ConvNet) model for video classification, based on the `i3d_r50` architecture from [PyTorchVideo](https://pytorchvideo.org/). The I3D model uses a ResNet-50 backbone inflated to 3D convolutions to capture both spatial and temporal features from videos. It was originally pretrained on the Kinetics-400 dataset, which contains ~306,245 short videos across 400 human action classes (e.g., running, dancing, cooking).

	The model was finetuned on the UCF-Crime dataset to classify videos into 8 specific categories: `arrest`, `Explosion`, `Fight`, `normal`, `roadaccidents`, `shooting`, `Stealing`, `vandalism`. During finetuning, the final fully connected layer was modified to output 8 classes, and a Dropout layer (p=0.3) was added to reduce overfitting. The finetuned weights are stored in `i3d_ucf_finetuned.pth` (109 MB) and can be downloaded from this repository.

	## Dataset
	### Pretraining Dataset
	- Kinetics-400: A large-scale dataset with ~306,245 videos covering 400 human action classes. It provides robust general features for video understanding, making it an excellent starting point for finetuning.

	### Finetuning Dataset
	- UCF-Crime: A dataset for anomaly detection in videos, containing ~1,900 videos (~1,610 for training, 290 for testing). The model was finetuned on a subset of UCF-Crime to classify videos into 8 categories: `arrest`, `Explosion`, `Fight`, `normal`, `roadaccidents`, `shooting`, `Stealing`, `vandalism`.

	## Performance
	The model was finetuned for 30 epochs. Below are the training and validation performance plots:

	### Training and Validation Accuracy
	![Training and Validation Accuracy](train_val_accuracy.jpg)

	- Best Validation Accuracy: ~66.67% (achieved after finetuning on UCF-Crime).
	- Training Accuracy: Reached ~81.03% .

	### Training and Validation Loss
	![Training and Validation Loss](train_val_loss.jpg)

	- The training loss decreases steadily, while the validation loss shows some fluctuations, indicating potential room for improving generalization.

	## Usage
	To use the model for video classification, you can load the weights from this repository using the following code:

	```python
	import torch
	import cv2
	import numpy as np
	import torch.nn as nn
	from huggingface_hub import hf_hub_download

	# Define the model
	def load_i3d_ucf_finetuned(repo_id="Ahmeddawood0001/i3d_ucf_finetuned", filename="i3d_ucf_finetuned.pth"):
	class I3DClassifier(nn.Module):
	def __init__(self, num_classes):
	super(I3DClassifier, self).__init__()
	self.i3d = torch.hub.load('facebookresearch/pytorchvideo', 'i3d_r50', pretrained=True)
	self.dropout = nn.Dropout(0.3)
	self.i3d.blocks[6].proj = nn.Linear(2048, num_classes)
	def forward(self, x):
	x = self.i3d(x)
	x = self.dropout(x)
	return x
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = I3DClassifier(num_classes=8).to(device)
	weights_path = hf_hub_download(repo_id=repo_id, filename=filename)
	model.load_state_dict(torch.load(weights_path))
	model.eval()
	return model

	# Define frame extraction function
	def extract_frames(video_path, max_frames=32, frame_size=(224, 224)):
	cap = cv2.VideoCapture(video_path)
	frames = []
	while len(frames) < max_frames:
	ret, frame = cap.read()
	if not ret:
	break
	frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
	frame = cv2.resize(frame, frame_size)
	frames.append(frame)
	while len(frames) < max_frames:
	frames.append(frames[-1])
	frames = frames[:max_frames]
	frames = np.stack(frames)
	frames = torch.from_numpy(frames).permute(0, 3, 1, 2).float() / 255.0
	frames = frames.permute(1, 0, 2, 3)
	cap.release()
	return frames

	# Define classification function
	def classify_video(video_path, model, labels):
	frames = extract_frames(video_path)
	frames = frames.unsqueeze(0).to(device)
	with torch.no_grad():
	outputs = model(frames)
	probabilities = torch.softmax(outputs, dim=1)
	predicted_idx = torch.argmax(probabilities, dim=1).item()
	predicted_label = labels[predicted_idx]
	confidence = probabilities[0, predicted_idx].item()
	return predicted_label, confidence

	# Example usage
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	labels = ["arrest", "Explosion", "Fight", "normal", "roadaccidents", "shooting", "Stealing", "vandalism"]
	model = load_i3d_ucf_finetuned()
	video_path = "path/to/your/video.mp4" # Replace with your video path
	predicted_label, confidence = classify_video(video_path, model, labels)
	print(f"Video: {video_path}")
	print(f"Predicted Label: {predicted_label}")

	print(f"Confidence: {confidence:.4f}")