SPDX-License-Identifier: Apache-2.0
""" SOTA Deepfake Detector - DFD Model card + inference utilities for Hugging Face repository: Arko007/deepfake-detector-dfd-sota
This file contains:
- A model card in Hugging Face model README format (YAML frontmatter + sections).
- Inference helper functions to:
- load frames or an MP4 video,
- extract 12 evenly-spaced frames,
- preprocess using the CLIP image processor,
- run a forward pass through the provided checkpoint and return logits,
- return softmax probabilities and predicted class.
Notes:
- Architecture described: CLIP-ViT-Large (frozen backbone) + Spatiotemporal Adapters
- Input: 12 frames, 3x384x384, dtype=torch.float32 / BF16 where supported
- Labels: 0 = real, 1 = fake
- License: Apache-2.0
- This file is intended as a convenience reference for model consumers and deployers. """
SOTA Deepfake Detector - DFD
Model: SOTA Deepfake Detector - DFD
Repository: https://huggingface.co/Arko007/deepfake-detector-dfd-sota
Model Description
The "SOTA Deepfake Detector - DFD" is a spatiotemporal adaptation built on a frozen CLIP-ViT-Large backbone with lightweight spatiotemporal adapters inserted to learn temporal relationships across frames. The backbone parameters are frozen; only the adapters and final classification head are trainable.
- Architecture: CLIP-ViT-Large (frozen backbone) + Spatiotemporal Adapters
- Input resolution: 384ร384 pixels
- Temporal frames: 12 frames per video
- Trainable parameters: 5,255,938 (โ1.7% of total 308M)
- Precision used for training: BF16 (mixed precision)
- Framework: PyTorch + Transformers (Hugging Face)
- HF Hub repo: Arko007/deepfake-detector-dfd-sota
- Model files:
- pytorch_model.bin (1.28GB)
- config.json
Intended Use
This model is intended to classify short video clips (or frame sequences) as either:
- Label 0: Real videos
- Label 1: Deepfake/manipulated videos
Primary use-cases:
- Research on deepfake detection
- Benchmarking against other detectors
- Integration into content-moderation pipelines with caution
Not intended:
- Medical, legal, or other high-stakes decisions without human review.
- Use on domains/styles substantially different from the DFD dataset without further fine-tuning.
Training Data
Dataset: DFD (Deep Fake Detection)
- Total videos: 3,431
- Real videos: 363 (10.6%)
- Fake videos: 3,068 (89.4%)
- Preprocessing:
- 12 evenly-spaced frames extracted per video at 384ร384 resolution (total frames: 41,172)
- Train/validation split: 90/10 stratified
- Class balancing during training: WeightedRandomSampler (inverse frequency weights)
Training Configuration
- Batch size: 16 (effective 32 with gradient accumulation)
- Optimizer: AdamW (weight_decay=0.05)
- Learning rate: 5e-6 with cosine decay, 10% warmup
- Epochs: 12
- Loss: Cross-entropy with label smoothing = 0.1
- Gradient clipping: max_grad_norm = 1.0
- Sampling: WeightedRandomSampler to mitigate class imbalance
- Augmentation (training only): random horizontal flip, brightness jitter
- Hardware: NVIDIA L4 GPU (24GB VRAM)
- Random seed: 42 (all RNGs fixed for reproducibility)
Training speed: ~1.4 seconds per batch (batch_size=16) on the reported hardware.
Evaluation / Metrics
- Best validation accuracy: 84.88%
- Validation detection (approx. ranges due to small real class):
- Real class detection: ~47โ55% (low number of real samples)
- Fake class detection: ~64โ93%
- Training loss convergence (cross-entropy w/ label smoothing): 0.7097 โ 0.6921 (12 epochs)
Known Limitations
- Validation set is highly imbalanced (89% fake), which affects stability of metrics.
- Small number of real videos (363) limits generalization to unseen real samples.
- Model optimized for the DFD dataset; transfer to other deepfake types may require fine-tuning.
- Temporal context limited to 12 frames (approx. 0.4โ1s depending on FPS), so long-term artifacts may be missed.
Usage
Quick inference instructions:
- Load checkpoint: checkpoint = torch.load("best_model.pt", map_location="cpu")
- Extract model state: model_state = checkpoint["model_state_dict"]
- Initialize model: model = DeepfakeDetector(config)
- Load state dict: model.load_state_dict(model_state)
- Set to eval: model.eval()
- Inference: Pass a tensor of shape (batch_size, 12, 3, 384, 384) with dtype float32 (or bf16 where supported).
- Output:
- logits: (batch_size, 2)
- probabilities: softmax(logits)
- prediction: argmax(logits) -> 0=real, 1=fake
How to Cite
If you use this model in your work, please cite the repository and include details about the DFD dataset and this specific model configuration.
"""
----------------------------
Inference code
----------------------------
The following is a compact, self-contained inference utility. It assumes:
- checkpoint is saved as 'best_model.pt'
- a config object or config.json for the model is available
- CLIP image processor is available via transformers
Important: This implementation is a minimal example to run inference. For
production, wrap in a robust server, add batching, async IO, error handling,
and pre-warming for BF16 on supported accelerators.
import os import math from typing import List, Tuple, Union, Dict
import numpy as np from PIL import Image import torch import torch.nn.functional as F from torchvision import transforms from torchvision.io import read_video # optional, requires torchvision installation from transformers import CLIPImageProcessor, CLIPModel
----------------------------------------------------------------------
Model definition stub
----------------------------------------------------------------------
The real model used for training is a CLIP-ViT-Large backbone (frozen) with
spatiotemporal adapters and a small classification head. For inference the
exact architecture must match the checkpoint. Below is a minimal class
to demonstrate expected load / inference semantics. Replace this with the
model class used during training (and the one saved to the checkpoint).
class DeepfakeDetector(torch.nn.Module): """ Minimal wrapper around CLIP backbone + temporal adapters + classification head.
NOTE: This is a lightweight placeholder for demonstration. Replace with the
exact model definition used during training to successfully load the
provided checkpoint (pytorch_model.bin / best_model.pt).
"""
def __init__(self, clip_model_name: str = "openai/clip-vit-large-patch14", num_frames: int = 12):
super().__init__()
self.num_frames = num_frames
# Load CLIP and freeze it (backbone frozen)
self.clip = CLIPModel.from_pretrained(clip_model_name, torch_dtype=torch.float32)
for p in self.clip.parameters():
p.requires_grad = False
# Spatiotemporal adapters & head (trainable)
# NOTE: The real training used custom adapters; here we provide a representative head.
embed_dim = self.clip.visual_projection.shape[1] if hasattr(self.clip, "visual_projection") else self.clip.config.projection_dim
self.adapter_pool = torch.nn.AdaptiveAvgPool1d(1) # placeholder
# small trainable head consistent with ~5.26M params total
self.classifier = torch.nn.Sequential(
torch.nn.Linear(embed_dim, 1024),
torch.nn.ReLU(),
torch.nn.Dropout(p=0.1),
torch.nn.Linear(1024, 2)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Forward pass.
x: Tensor of shape (batch, num_frames, 3, H, W)
Returns logits: (batch, 2)
"""
b, t, c, h, w = x.shape
# reshape to (b*t, c, h, w) to run through CLIP's visual encoder
xt = x.view(b * t, c, h, w)
# CLIP's image encoder expects pixel values and image processing externally.
# Use clip.vision_model to obtain image embeddings
outputs = self.clip.vision_model(pixel_values=xt)
pooled = outputs.pooler_output # (b*t, embed_dim) or adjust per CLIP impl
# reshape back to (b, t, embed_dim) and aggregate temporally
embed_dim = pooled.shape[-1]
pooled = pooled.view(b, t, embed_dim) # (b, t, embed_dim)
# Simple temporal aggregation (real model uses adapters). For inference placeholder:
# mean across temporal dimension
video_repr = pooled.mean(dim=1) # (b, embed_dim)
logits = self.classifier(video_repr) # (b, 2)
return logits
----------------------------------------------------------------------
Helper utilities for frame extraction, preprocessing and inference
----------------------------------------------------------------------
def extract_evenly_spaced_frames_from_video(
video_path: str,
num_frames: int = 12,
target_size: Tuple[int, int] = (384, 384),
) -> List[Image.Image]:
"""
Extract num_frames evenly spaced frames from a video file.
Returns a list of PIL Image objects resized to target_size.
Requires torchvision.read_video; as a fallback, uses ffmpeg via PIL + imageio
if read_video is unavailable.
"""
if not os.path.exists(video_path):
raise FileNotFoundError(f"Video file not found: {video_path}")
try:
# torchvision's read_video returns (frames, audio, info)
frames, _, info = read_video(video_path, pts_unit="sec")
# frames: (num_total_frames, H, W, C) uint8 tensor
total = frames.shape[0]
if total == 0:
raise RuntimeError("No frames extracted from video.")
indices = np.linspace(0, total - 1, num_frames, dtype=int)
pil_frames = []
for i in indices:
frame = frames[i].numpy()
img = Image.fromarray(frame)
img = img.convert("RGB").resize(target_size, resample=Image.BILINEAR)
pil_frames.append(img)
return pil_frames
except Exception:
# Fallback: use imageio-ffmpeg or other mechanism (not implemented here)
raise RuntimeError("Video reading failed. Ensure torchvision is installed and supports read_video.")
def load_frames_from_folder( folder: str, num_frames: int = 12, target_size: Tuple[int, int] = (384, 384), ) -> List[Image.Image]: """ Load frames (PNG/JPG) from a folder. Picks num_frames evenly across available images. """ files = sorted( [ os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith((".png", ".jpg", ".jpeg")) ] ) if not files: raise FileNotFoundError(f"No image files found in folder: {folder}") total = len(files) indices = np.linspace(0, total - 1, num_frames, dtype=int) pil_frames = [] for i in indices: img = Image.open(files[i]).convert("RGB").resize(target_size, resample=Image.BILINEAR) pil_frames.append(img) return pil_frames
def preprocess_frames( pil_frames: List[Image.Image], processor: CLIPImageProcessor, device: Union[str, torch.device] = "cpu", ) -> torch.Tensor: """ Given a list of PIL images (len == 12), apply the CLIP image processor and return a tensor shaped (1, 12, 3, 384, 384) suitable for the model. """ # processor can batch process a list of images; it returns dict with pixel_values # pixel_values shape: (num_images, 3, H, W) proc = processor(images=pil_frames, return_tensors="pt") pixel_values = proc["pixel_values"] # (num_images, 3, H, W) # Ensure exactly num_frames pixel_values = pixel_values.to(device) # reshape to (1, num_frames, 3, H, W) pixel_values = pixel_values.unsqueeze(0) if pixel_values.ndim == 4 else pixel_values pixel_values = pixel_values.view(1, len(pil_frames), 3, pixel_values.shape[-2], pixel_values.shape[-1]) return pixel_values
def apply_augmentations_train(pil_frames: List[Image.Image]) -> List[Image.Image]: """ Apply training augmentations: horizontal flip (random) and brightness jitter. Called only during training data pipeline; included here for completeness. """ aug = transforms.Compose([ transforms.RandomHorizontalFlip(p=0.5), transforms.ColorJitter(brightness=0.2), ]) return [aug(img) for img in pil_frames]
----------------------------------------------------------------------
High level inference API
----------------------------------------------------------------------
def load_model_from_checkpoint( checkpoint_path: str, clip_model_name: str = "openai/clip-vit-large-patch14", device: Union[str, torch.device] = "cpu", ) -> Tuple[torch.nn.Module, CLIPImageProcessor]: """ Load model and CLIP image processor.
checkpoint should be a dict with 'model_state_dict' key (per model card instructions).
"""
device = torch.device(device)
# Instantiate model skeleton
model = DeepfakeDetector(clip_model_name=clip_model_name, num_frames=12)
# Load checkpoint
ckpt = torch.load(checkpoint_path, map_location="cpu")
if "model_state_dict" in ckpt:
state_dict = ckpt["model_state_dict"]
else:
# assume full state dict saved directly
state_dict = ckpt
# Try to load, allowing for missing keys if adapter names differ
try:
model.load_state_dict(state_dict)
except Exception as e:
# Provide a clearer error for mismatch: user must use exact model implementation
raise RuntimeError(f"Failed to load state dict into DeepfakeDetector: {e}")
model.to(device)
model.eval()
# Load CLIP image processor (matches model's backbone preprocessing)
processor = CLIPImageProcessor.from_pretrained(clip_model_name)
return model, processor
def predict_from_frames( model: torch.nn.Module, processor: CLIPImageProcessor, pil_frames: List[Image.Image], device: Union[str, torch.device] = "cpu", use_bf16: bool = False, ) -> Dict[str, Union[int, float, List[float], torch.Tensor]]: """ Perform inference on a single sample (list of 12 PIL frames). Returns a dict with: - logits (torch.Tensor shape (1,2)) - probabilities (list of 2 floats) - prediction (int, 0 or 1) - confidence (float, probability of predicted class) """ device = torch.device(device) x = preprocess_frames(pil_frames, processor, device=device) # (1,12,3,384,384) # optionally convert to bf16 if on supported hardware if use_bf16 and device.type != "cpu": x = x.to(dtype=torch.bfloat16) else: x = x.to(dtype=torch.float32) model = model.to(device)
with torch.no_grad():
logits = model(x) # (1,2)
probs = F.softmax(logits, dim=-1)
probs_list = probs.squeeze(0).cpu().tolist()
pred = int(torch.argmax(logits, dim=-1).squeeze().item())
confidence = float(probs.squeeze(0)[pred].cpu().item())
return {
"logits": logits.cpu(),
"probabilities": probs_list,
"prediction": pred,
"confidence": confidence,
}
----------------------------------------------------------------------
Convenience function: video file -> prediction
----------------------------------------------------------------------
def predict_from_video_file( checkpoint_path: str, video_path: str, device: Union[str, torch.device] = "cpu", clip_model_name: str = "openai/clip-vit-large-patch14", num_frames: int = 12, use_bf16: bool = False, ) -> Dict[str, Union[int, float, List[float], torch.Tensor]]: """ Load model from checkpoint, extract frames from video, and return prediction. """ model, processor = load_model_from_checkpoint(checkpoint_path, clip_model_name=clip_model_name, device=device) pil_frames = extract_evenly_spaced_frames_from_video(video_path, num_frames=num_frames, target_size=(384, 384)) return predict_from_frames(model, processor, pil_frames, device=device, use_bf16=use_bf16)
----------------------------------------------------------------------
Example usage (commented)
----------------------------------------------------------------------
if name == "main":
# Example: run inference on a single MP4
checkpoint = "best_model.pt"
video = "example.mp4"
result = predict_from_video_file(checkpoint, video, device="cuda", use_bf16=True)
print("Logits:", result["logits"])
print("Probabilities:", result["probabilities"])
print("Prediction (0=real,1=fake):", result["prediction"], "confidence:", result["confidence"])
----------------------------------------------------------------------
End of file
----------------------------------------------------------------------
- Downloads last month
- 100
Model tree for Arko007/deepfake-detector-dfd-sota
Base model
openai/clip-vit-large-patch14