WAN 2.5 FP16 - Image-to-Video Generation Model

Version: v1.4 Precision: FP16 (16-bit floating point) Model Family: WAN (Video Generation) Task: Image-to-Video Generation

Model Description

WAN 2.5 Image-to-Video (I2V) is a state-of-the-art diffusion model capable of generating high-quality video sequences from static images. This FP16 version provides a balance between model quality and computational efficiency, making it suitable for systems with moderate GPU resources.

Key Capabilities

Image-to-Video Generation: Animate static images into coherent video sequences
Temporal Coherence: Produces smooth, temporally consistent video frames
Motion Control: Advanced control over motion dynamics and camera movements
Lighting Preservation: Maintains lighting consistency from source image
Quality Enhancement: Support for LoRA adapters for improved output quality
Efficient Inference: FP16 precision reduces memory footprint while maintaining quality

Model Architecture

Diffusion Framework: Latent diffusion-based video generation
Conditioning: Image-conditioned video synthesis
Precision: FP16 (half-precision floating point)
Format: SafeTensors (secure, efficient format)
VAE: Variational Autoencoder for latent space encoding/decoding

Repository Contents

Status: Repository structure prepared for model files (currently empty).

Current Directory Structure

wan25-fp16-i2v/
├── diffusion_models/
│   └── wan/                    # Empty - awaiting model download
├── README.md                   # This file (15 KB)
└── (model files to be added)

Expected Model Files (After Download)

The repository is organized to store WAN 2.5 FP16 I2V model files once downloaded from Hugging Face:

Core Model Files (to be placed in diffusion_models/wan/):

wan_2.5_i2v_fp16.safetensors - Main UNet diffusion model for video generation (~8-12 GB)
wan_vae_fp16.safetensors - VAE for encoding/decoding video frames (~1-2 GB)
image_encoder.safetensors - CLIP/VAE image encoder for conditioning (~1-2 GB)
config.json - Model architecture configuration and hyperparameters (~5-10 KB)

Optional LoRA Adapters (to be placed in loras/ directory if downloaded):

motion_control_lora.safetensors - Fine-grained motion dynamics control (~100-500 MB)
camera_control_lora.safetensors - Camera movement and perspective control (~100-500 MB)
quality_enhancement_lora.safetensors - Output quality improvements (~100-500 MB)

Total Repository Size:

Current: ~15 KB (documentation only)
After Model Download: 10-15 GB (core model) + 0.3-1.5 GB (optional LoRAs)

Download Instructions

To populate this repository with model files:

# Install Hugging Face CLI
pip install huggingface-hub

# Download WAN 2.5 FP16 I2V model (requires HF authentication)
huggingface-cli login
huggingface-cli download Wan/WAN-2.5-I2V --revision fp16 --local-dir "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan"

Hardware Requirements

Minimum Requirements (FP16)

GPU: NVIDIA RTX 3090 (24 GB VRAM) or AMD equivalent
System RAM: 32 GB
Disk Space: 20 GB free space
CUDA: 11.8 or higher (for NVIDIA GPUs)

Recommended Requirements

GPU: NVIDIA RTX 4090 (24 GB VRAM) or A5000/A6000
System RAM: 64 GB
Disk Space: 30 GB free space (for model + output cache)
CUDA: 12.1 or higher

Performance Expectations

Short Videos (2-4 seconds): ~30-60 seconds generation time
Medium Videos (5-10 seconds): ~1-3 minutes generation time
Long Videos (10-15 seconds): ~3-5 minutes generation time

Generation times vary based on resolution, frame rate, and sampling steps

Usage Examples

Installation

# Install dependencies
pip install diffusers transformers accelerate safetensors torch torchvision pillow

Basic Image-to-Video Generation

import torch
from diffusers import DiffusionPipeline
from PIL import Image

# Load the model
pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipeline.to("cuda")

# Enable memory optimizations
pipeline.enable_attention_slicing()
pipeline.enable_vae_slicing()

# Load source image
image = Image.open("input_image.jpg")

# Generate video from image
prompt = "Add gentle camera pan and natural motion"
video = pipeline(
    image=image,                    # Source image
    prompt=prompt,                  # Optional motion guidance
    num_frames=64,                  # Number of frames to generate
    height=512,                     # Video height
    width=512,                      # Video width
    num_inference_steps=50,         # Sampling steps (higher = better quality)
    guidance_scale=7.5,             # Prompt adherence (higher = closer to prompt)
    image_guidance_scale=1.0        # Image fidelity (higher = closer to source)
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)

Advanced Generation with Motion Control

import torch
from diffusers import DiffusionPipeline
from PIL import Image

# Load model with LoRA support
pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16
)
pipeline.to("cuda")

# Load LoRA adapters for enhanced control
pipeline.load_lora_weights("E:/huggingface/wan25-fp16-i2v/loras", adapter_name="motion_control")
pipeline.load_lora_weights("E:/huggingface/wan25-fp16-i2v/loras", adapter_name="camera_control")

# Enable adapters with specific weights
pipeline.set_adapters(["motion_control", "camera_control"], adapter_weights=[0.8, 0.7])

# Load source image
image = Image.open("landscape.jpg")

# Generate with enhanced control
prompt = "Smooth dolly forward, subtle parallax, cinematic motion"
video = pipeline(
    image=image,
    prompt=prompt,
    num_frames=96,                  # More frames for longer video
    height=768,                     # Higher resolution
    width=768,
    num_inference_steps=75,         # More steps for quality
    guidance_scale=8.0,
    image_guidance_scale=1.2        # Strong image fidelity
).frames

export_to_video(video, "enhanced_output.mp4", fps=12)

Memory-Efficient Generation

import torch
from diffusers import DiffusionPipeline
from PIL import Image

pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16
)
pipeline.to("cuda")

# Enable all memory optimizations
pipeline.enable_attention_slicing()
pipeline.enable_vae_slicing()
pipeline.enable_sequential_cpu_offload()  # Offload to CPU when not in use

# Load and resize image for efficiency
image = Image.open("photo.jpg")
image = image.resize((512, 512))

# Generate with reduced memory footprint
prompt = "Subtle natural motion and breathing life into the scene"
video = pipeline(
    image=image,
    prompt=prompt,
    num_frames=48,                  # Fewer frames for memory efficiency
    height=512,
    width=512,
    num_inference_steps=30,         # Fewer steps for faster generation
    guidance_scale=7.0
).frames

export_to_video(video, "memory_efficient_output.mp4", fps=8)

Batch Processing Multiple Images

import torch
from diffusers import DiffusionPipeline
from PIL import Image
import os

pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16
)
pipeline.to("cuda")
pipeline.enable_attention_slicing()
pipeline.enable_vae_slicing()

# Process multiple images
input_dir = "E:/input_images"
output_dir = "E:/output_videos"
os.makedirs(output_dir, exist_ok=True)

for img_file in os.listdir(input_dir):
    if img_file.endswith(('.jpg', '.png', '.jpeg')):
        # Load image
        image = Image.open(os.path.join(input_dir, img_file))

        # Generate video
        video = pipeline(
            image=image,
            prompt="Cinematic motion, natural dynamics",
            num_frames=64,
            height=512,
            width=512,
            num_inference_steps=40
        ).frames

        # Save with matching name
        output_path = os.path.join(output_dir, f"{os.path.splitext(img_file)[0]}.mp4")
        export_to_video(video, output_path, fps=8)
        print(f"Generated: {output_path}")

Model Specifications

Technical Details

Specification	Value
Model Type	Latent Diffusion (Image-to-Video)
Precision	FP16 (16-bit)
Format	SafeTensors
Max Frames	96-128 frames
Resolution	512x512 to 1024x1024
Image Encoder	CLIP/VAE-based
VAE Channels	4 (latent)
Sampling	DDPM, DDIM, DPM-Solver++

Supported Features

✅ Image-to-video generation
✅ Motion dynamics control
✅ Camera movement control
✅ Prompt-guided motion
✅ Image fidelity preservation
✅ LoRA adapter support
✅ Memory optimization techniques
✅ Batch processing
✅ Custom sampling schedulers
✅ Frame interpolation support

Limitations

⚠️ Video length limited by VRAM (typically 2-15 seconds)
⚠️ Requires significant GPU memory (24 GB minimum recommended)
⚠️ Generation time increases with frame count and resolution
⚠️ Complex motions may require higher sampling steps for coherence
⚠️ Source image quality directly affects output quality
⚠️ Very high contrast or unusual images may produce artifacts

Performance Tips and Optimization

Memory Optimization

Enable Attention Slicing: Reduces VRAM usage at slight speed cost
```
pipeline.enable_attention_slicing()
```
Enable VAE Slicing: Processes VAE in smaller chunks
```
pipeline.enable_vae_slicing()
```
CPU Offloading: Move model components to CPU when not in use
```
pipeline.enable_sequential_cpu_offload()
```
Reduce Resolution: Start with 512x512 for testing, upscale later
Resize Source Images: Preprocess images to target resolution
```
image = image.resize((512, 512), Image.LANCZOS)
```

Quality Optimization

Increase Inference Steps: 50-100 steps for higher quality (slower)
Adjust Guidance Scales:
- guidance_scale: 7.0-9.0 for prompt adherence
- image_guidance_scale: 1.0-1.5 for image fidelity
Use LoRA Adapters: Enhance motion, camera, and quality aspects
Frame Interpolation: Generate fewer frames, interpolate with RIFE/FILM
High-Quality Source Images: Use clean, well-lit source images

Speed Optimization

Reduce Inference Steps: 20-30 steps for faster generation (lower quality)
Lower Resolution: 512x512 generates 4x faster than 1024x1024
Fewer Frames: Generate 48-64 frames instead of 96-128

Use DPM-Solver++: Faster sampling scheduler

from diffusers import DPMSolverMultistepScheduler
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)

Prompt Engineering Tips

Describe Motion: "gentle pan", "slow zoom", "subtle motion"
Camera Movements: "dolly in", "crane up", "orbit around"
Motion Quality: "smooth", "cinematic", "natural dynamics"
Avoid Contradictions: Keep motion descriptions coherent
Optional Prompts: Prompts guide motion; can be empty for automatic motion
Scene Context: Reference elements in the source image

Source Image Best Practices

Resolution: Use images at or near target video resolution
Quality: High-quality, well-exposed images work best
Composition: Well-composed images produce better results
Lighting: Consistent lighting makes animation more coherent
Subject Matter: Clear subjects with defined edges animate better
Avoid: Very blurry, low-resolution, or extremely dark images

License

This model is released under a custom WAN license. Please review the license terms before use.

Usage Restrictions

✅ Research and non-commercial use permitted
✅ Educational and academic use permitted
⚠️ Commercial use may require separate licensing
❌ Do not use for generating harmful, misleading, or illegal content
❌ Do not use for deepfakes or impersonation without consent
❌ Respect copyright and intellectual property rights of source images

Please refer to the official WAN model documentation for complete license terms.

Citation

If you use this model in your research or projects, please cite:

@misc{wan25-i2v-fp16,
  title={WAN 2.5: Image-to-Video Diffusion Model},
  author={WAN Team},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Wan/WAN-2.5-I2V}},
  note={FP16 variant}
}

Resources and Links

Official Resources

Hugging Face Model Card: https://huggingface.co/Wan/WAN-2.5-I2V
WAN Official Documentation: [Link to official docs when available]
Model Paper: [ArXiv link when available]

Community and Support

Hugging Face Forums: https://discuss.huggingface.co/
GitHub Issues: [Repository link when available]
Discord Community: [Discord invite when available]

Related Models

WAN 2.5 Text-to-Video: Text-conditioned video generation variant
WAN 2.5 FP8: More memory-efficient variant (lower precision)
WAN 2.5 Full: Full precision variant (higher quality, more VRAM)
FLUX.1: Alternative text-to-image models in this repository

Tutorials and Examples

Diffusers Documentation: https://huggingface.co/docs/diffusers
Image-to-Video Guide: https://huggingface.co/docs/diffusers/using-diffusers/image-to-video
LoRA Training Guide: https://huggingface.co/docs/diffusers/training/lora

Version History

v1.4 (2025-10-28)

Verified YAML frontmatter compliance with HuggingFace requirements
Confirmed repository structure documentation accuracy
Validated metadata fields (license, library_name, pipeline_tag, tags)
Repository remains prepared for model file downloads

v1.3 (2025-10-14)

CRITICAL FIX: Corrected pipeline_tag from text-to-video to image-to-video
Updated all documentation to reflect Image-to-Video (I2V) functionality
Revised usage examples for image-conditioned generation
Added image_guidance_scale parameter documentation
Updated tags to include image-to-video
Added source image best practices section
Corrected model file naming conventions for I2V variant

v1.2 (2025-10-14)

Simplified YAML frontmatter to essential fields only per requirements
Removed base_model and base_model_relation (base model, not derived)
Streamlined tags for better discoverability
Verified directory structure (still awaiting model download)

v1.1 (2025-10-14)

Updated YAML frontmatter to be first in file
Corrected repository contents to reflect actual directory state
Added download instructions for model files
Clarified that model files are pending download
Moved version comment after YAML frontmatter per HuggingFace standards

v1.0 (2025-10-13)

Created repository structure
Documented expected model files and usage
Provided comprehensive usage examples
Included hardware requirements and optimization tips

Contact and Contributions

For questions, issues, or contributions related to this repository organization:

Local repository maintained for personal use
See official WAN model repository for model-specific issues
Refer to Hugging Face documentation for diffusers library support

Repository Maintained By: Local User Last Updated: 2025-10-28 README Version: v1.4

Downloads last month: -

Collection including wangkanai/wan25-fp16-i2v

wan-2.5

Collection

wan 2.5 video models • 16 items • Updated about 19 hours ago • 2