SPDX-License-Identifier: Apache-2.0

""" SOTA Deepfake Detector - DFD Model card + inference utilities for Hugging Face repository: Arko007/deepfake-detector-dfd-sota

This file contains:

A model card in Hugging Face model README format (YAML frontmatter + sections).
Inference helper functions to:
- load frames or an MP4 video,
- extract 12 evenly-spaced frames,
- preprocess using the CLIP image processor,
- run a forward pass through the provided checkpoint and return logits,
- return softmax probabilities and predicted class.

Notes:

Architecture described: CLIP-ViT-Large (frozen backbone) + Spatiotemporal Adapters
Input: 12 frames, 3x384x384, dtype=torch.float32 / BF16 where supported
Labels: 0 = real, 1 = fake
License: Apache-2.0
This file is intended as a convenience reference for model consumers and deployers. """

SOTA Deepfake Detector - DFD

Model: SOTA Deepfake Detector - DFD
Repository: https://huggingface.co/Arko007/deepfake-detector-dfd-sota

Model Description

The "SOTA Deepfake Detector - DFD" is a spatiotemporal adaptation built on a frozen CLIP-ViT-Large backbone with lightweight spatiotemporal adapters inserted to learn temporal relationships across frames. The backbone parameters are frozen; only the adapters and final classification head are trainable.

Architecture: CLIP-ViT-Large (frozen backbone) + Spatiotemporal Adapters
Input resolution: 384×384 pixels
Temporal frames: 12 frames per video
Trainable parameters: 5,255,938 (≈1.7% of total 308M)
Precision used for training: BF16 (mixed precision)
Framework: PyTorch + Transformers (Hugging Face)
HF Hub repo: Arko007/deepfake-detector-dfd-sota
Model files:
- pytorch_model.bin (1.28GB)
- config.json

Intended Use

This model is intended to classify short video clips (or frame sequences) as either:

Label 0: Real videos
Label 1: Deepfake/manipulated videos

Primary use-cases:

Research on deepfake detection
Benchmarking against other detectors
Integration into content-moderation pipelines with caution

Not intended:

Medical, legal, or other high-stakes decisions without human review.
Use on domains/styles substantially different from the DFD dataset without further fine-tuning.

Training Data

Dataset: DFD (Deep Fake Detection)

Total videos: 3,431
- Real videos: 363 (10.6%)
- Fake videos: 3,068 (89.4%)
Preprocessing:
- 12 evenly-spaced frames extracted per video at 384×384 resolution (total frames: 41,172)
Train/validation split: 90/10 stratified
Class balancing during training: WeightedRandomSampler (inverse frequency weights)

Training Configuration

Batch size: 16 (effective 32 with gradient accumulation)
Optimizer: AdamW (weight_decay=0.05)
Learning rate: 5e-6 with cosine decay, 10% warmup
Epochs: 12
Loss: Cross-entropy with label smoothing = 0.1
Gradient clipping: max_grad_norm = 1.0
Sampling: WeightedRandomSampler to mitigate class imbalance
Augmentation (training only): random horizontal flip, brightness jitter
Hardware: NVIDIA L4 GPU (24GB VRAM)
Random seed: 42 (all RNGs fixed for reproducibility)

Training speed: ~1.4 seconds per batch (batch_size=16) on the reported hardware.

Evaluation / Metrics

Best validation accuracy: 84.88%
Validation detection (approx. ranges due to small real class):
- Real class detection: ~47–55% (low number of real samples)
- Fake class detection: ~64–93%
Training loss convergence (cross-entropy w/ label smoothing): 0.7097 → 0.6921 (12 epochs)

Known Limitations

Validation set is highly imbalanced (89% fake), which affects stability of metrics.
Small number of real videos (363) limits generalization to unseen real samples.
Model optimized for the DFD dataset; transfer to other deepfake types may require fine-tuning.
Temporal context limited to 12 frames (approx. 0.4–1s depending on FPS), so long-term artifacts may be missed.

Usage

Quick inference instructions:

Load checkpoint: checkpoint = torch.load("best_model.pt", map_location="cpu")
Extract model state: model_state = checkpoint["model_state_dict"]
Initialize model: model = DeepfakeDetector(config)
Load state dict: model.load_state_dict(model_state)
Set to eval: model.eval()
Inference: Pass a tensor of shape (batch_size, 12, 3, 384, 384) with dtype float32 (or bf16 where supported).
Output:
- logits: (batch_size, 2)
- probabilities: softmax(logits)
- prediction: argmax(logits) -> 0=real, 1=fake

How to Cite

If you use this model in your work, please cite the repository and include details about the DFD dataset and this specific model configuration.

"""

----------------------------

Inference code

----------------------------

The following is a compact, self-contained inference utility. It assumes:

- checkpoint is saved as 'best_model.pt'

- a config object or config.json for the model is available

- CLIP image processor is available via transformers

Important: This implementation is a minimal example to run inference. For

production, wrap in a robust server, add batching, async IO, error handling,

and pre-warming for BF16 on supported accelerators.

import os import math from typing import List, Tuple, Union, Dict

import numpy as np from PIL import Image import torch import torch.nn.functional as F from torchvision import transforms from torchvision.io import read_video # optional, requires torchvision installation from transformers import CLIPImageProcessor, CLIPModel

----------------------------------------------------------------------

Model definition stub

----------------------------------------------------------------------

The real model used for training is a CLIP-ViT-Large backbone (frozen) with

spatiotemporal adapters and a small classification head. For inference the

exact architecture must match the checkpoint. Below is a minimal class

to demonstrate expected load / inference semantics. Replace this with the

model class used during training (and the one saved to the checkpoint).

class DeepfakeDetector(torch.nn.Module): """ Minimal wrapper around CLIP backbone + temporal adapters + classification head.

NOTE: This is a lightweight placeholder for demonstration. Replace with the
exact model definition used during training to successfully load the
provided checkpoint (pytorch_model.bin / best_model.pt).
"""

def __init__(self, clip_model_name: str = "openai/clip-vit-large-patch14", num_frames: int = 12):
    super().__init__()
    self.num_frames = num_frames
    # Load CLIP and freeze it (backbone frozen)
    self.clip = CLIPModel.from_pretrained(clip_model_name, torch_dtype=torch.float32)
    for p in self.clip.parameters():
        p.requires_grad = False

    # Spatiotemporal adapters & head (trainable)
    # NOTE: The real training used custom adapters; here we provide a representative head.
    embed_dim = self.clip.visual_projection.shape[1] if hasattr(self.clip, "visual_projection") else self.clip.config.projection_dim
    self.adapter_pool = torch.nn.AdaptiveAvgPool1d(1)  # placeholder
    # small trainable head consistent with ~5.26M params total
    self.classifier = torch.nn.Sequential(
        torch.nn.Linear(embed_dim, 1024),
        torch.nn.ReLU(),
        torch.nn.Dropout(p=0.1),
        torch.nn.Linear(1024, 2)
    )

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """
    Forward pass.

    x: Tensor of shape (batch, num_frames, 3, H, W)
    Returns logits: (batch, 2)
    """
    b, t, c, h, w = x.shape
    # reshape to (b*t, c, h, w) to run through CLIP's visual encoder
    xt = x.view(b * t, c, h, w)

    # CLIP's image encoder expects pixel values and image processing externally.
    # Use clip.vision_model to obtain image embeddings
    outputs = self.clip.vision_model(pixel_values=xt)
    pooled = outputs.pooler_output  # (b*t, embed_dim) or adjust per CLIP impl

    # reshape back to (b, t, embed_dim) and aggregate temporally
    embed_dim = pooled.shape[-1]
    pooled = pooled.view(b, t, embed_dim)  # (b, t, embed_dim)

    # Simple temporal aggregation (real model uses adapters). For inference placeholder:
    # mean across temporal dimension
    video_repr = pooled.mean(dim=1)  # (b, embed_dim)

    logits = self.classifier(video_repr)  # (b, 2)
    return logits

----------------------------------------------------------------------

Helper utilities for frame extraction, preprocessing and inference

----------------------------------------------------------------------

def extract_evenly_spaced_frames_from_video( video_path: str, num_frames: int = 12, target_size: Tuple[int, int] = (384, 384), ) -> List[Image.Image]: """ Extract num_frames evenly spaced frames from a video file.

Returns a list of PIL Image objects resized to target_size.
Requires torchvision.read_video; as a fallback, uses ffmpeg via PIL + imageio
if read_video is unavailable.
"""
if not os.path.exists(video_path):
    raise FileNotFoundError(f"Video file not found: {video_path}")

try:
    # torchvision's read_video returns (frames, audio, info)
    frames, _, info = read_video(video_path, pts_unit="sec")
    # frames: (num_total_frames, H, W, C) uint8 tensor
    total = frames.shape[0]
    if total == 0:
        raise RuntimeError("No frames extracted from video.")
    indices = np.linspace(0, total - 1, num_frames, dtype=int)
    pil_frames = []
    for i in indices:
        frame = frames[i].numpy()
        img = Image.fromarray(frame)
        img = img.convert("RGB").resize(target_size, resample=Image.BILINEAR)
        pil_frames.append(img)
    return pil_frames
except Exception:
    # Fallback: use imageio-ffmpeg or other mechanism (not implemented here)
    raise RuntimeError("Video reading failed. Ensure torchvision is installed and supports read_video.")

def load_frames_from_folder( folder: str, num_frames: int = 12, target_size: Tuple[int, int] = (384, 384), ) -> List[Image.Image]: """ Load frames (PNG/JPG) from a folder. Picks num_frames evenly across available images. """ files = sorted( [ os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith((".png", ".jpg", ".jpeg")) ] ) if not files: raise FileNotFoundError(f"No image files found in folder: {folder}") total = len(files) indices = np.linspace(0, total - 1, num_frames, dtype=int) pil_frames = [] for i in indices: img = Image.open(files[i]).convert("RGB").resize(target_size, resample=Image.BILINEAR) pil_frames.append(img) return pil_frames

def preprocess_frames( pil_frames: List[Image.Image], processor: CLIPImageProcessor, device: Union[str, torch.device] = "cpu", ) -> torch.Tensor: """ Given a list of PIL images (len == 12), apply the CLIP image processor and return a tensor shaped (1, 12, 3, 384, 384) suitable for the model. """ # processor can batch process a list of images; it returns dict with pixel_values # pixel_values shape: (num_images, 3, H, W) proc = processor(images=pil_frames, return_tensors="pt") pixel_values = proc["pixel_values"] # (num_images, 3, H, W) # Ensure exactly num_frames pixel_values = pixel_values.to(device) # reshape to (1, num_frames, 3, H, W) pixel_values = pixel_values.unsqueeze(0) if pixel_values.ndim == 4 else pixel_values pixel_values = pixel_values.view(1, len(pil_frames), 3, pixel_values.shape[-2], pixel_values.shape[-1]) return pixel_values

def apply_augmentations_train(pil_frames: List[Image.Image]) -> List[Image.Image]: """ Apply training augmentations: horizontal flip (random) and brightness jitter. Called only during training data pipeline; included here for completeness. """ aug = transforms.Compose([ transforms.RandomHorizontalFlip(p=0.5), transforms.ColorJitter(brightness=0.2), ]) return [aug(img) for img in pil_frames]

----------------------------------------------------------------------

High level inference API

----------------------------------------------------------------------

def load_model_from_checkpoint( checkpoint_path: str, clip_model_name: str = "openai/clip-vit-large-patch14", device: Union[str, torch.device] = "cpu", ) -> Tuple[torch.nn.Module, CLIPImageProcessor]: """ Load model and CLIP image processor.

checkpoint should be a dict with 'model_state_dict' key (per model card instructions).
"""
device = torch.device(device)
# Instantiate model skeleton
model = DeepfakeDetector(clip_model_name=clip_model_name, num_frames=12)

# Load checkpoint
ckpt = torch.load(checkpoint_path, map_location="cpu")
if "model_state_dict" in ckpt:
    state_dict = ckpt["model_state_dict"]
else:
    # assume full state dict saved directly
    state_dict = ckpt

# Try to load, allowing for missing keys if adapter names differ
try:
    model.load_state_dict(state_dict)
except Exception as e:
    # Provide a clearer error for mismatch: user must use exact model implementation
    raise RuntimeError(f"Failed to load state dict into DeepfakeDetector: {e}")

model.to(device)
model.eval()

# Load CLIP image processor (matches model's backbone preprocessing)
processor = CLIPImageProcessor.from_pretrained(clip_model_name)
return model, processor

def predict_from_frames( model: torch.nn.Module, processor: CLIPImageProcessor, pil_frames: List[Image.Image], device: Union[str, torch.device] = "cpu", use_bf16: bool = False, ) -> Dict[str, Union[int, float, List[float], torch.Tensor]]: """ Perform inference on a single sample (list of 12 PIL frames). Returns a dict with: - logits (torch.Tensor shape (1,2)) - probabilities (list of 2 floats) - prediction (int, 0 or 1) - confidence (float, probability of predicted class) """ device = torch.device(device) x = preprocess_frames(pil_frames, processor, device=device) # (1,12,3,384,384) # optionally convert to bf16 if on supported hardware if use_bf16 and device.type != "cpu": x = x.to(dtype=torch.bfloat16) else: x = x.to(dtype=torch.float32) model = model.to(device)

with torch.no_grad():
    logits = model(x)  # (1,2)
    probs = F.softmax(logits, dim=-1)
    probs_list = probs.squeeze(0).cpu().tolist()
    pred = int(torch.argmax(logits, dim=-1).squeeze().item())
    confidence = float(probs.squeeze(0)[pred].cpu().item())

return {
    "logits": logits.cpu(),
    "probabilities": probs_list,
    "prediction": pred,
    "confidence": confidence,
}

----------------------------------------------------------------------

Convenience function: video file -> prediction

----------------------------------------------------------------------

def predict_from_video_file( checkpoint_path: str, video_path: str, device: Union[str, torch.device] = "cpu", clip_model_name: str = "openai/clip-vit-large-patch14", num_frames: int = 12, use_bf16: bool = False, ) -> Dict[str, Union[int, float, List[float], torch.Tensor]]: """ Load model from checkpoint, extract frames from video, and return prediction. """ model, processor = load_model_from_checkpoint(checkpoint_path, clip_model_name=clip_model_name, device=device) pil_frames = extract_evenly_spaced_frames_from_video(video_path, num_frames=num_frames, target_size=(384, 384)) return predict_from_frames(model, processor, pil_frames, device=device, use_bf16=use_bf16)

----------------------------------------------------------------------

Example usage (commented)

----------------------------------------------------------------------

if name == "main":

# Example: run inference on a single MP4

checkpoint = "best_model.pt"

video = "example.mp4"

result = predict_from_video_file(checkpoint, video, device="cuda", use_bf16=True)

print("Logits:", result["logits"])

print("Probabilities:", result["probabilities"])

print("Prediction (0=real,1=fake):", result["prediction"], "confidence:", result["confidence"])

----------------------------------------------------------------------

End of file

----------------------------------------------------------------------

Downloads last month: 100

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Arko007/deepfake-detector-dfd-sota

Base model

openai/clip-vit-large-patch14

Finetuned

(115)

this model