WAN 2.5 FP16 - Image-to-Video Generation Model

Version: v1.4 Precision: FP16 (16-bit floating point) Model Family: WAN (Video Generation) Task: Image-to-Video Generation

Model Description

WAN 2.5 Image-to-Video (I2V) is a state-of-the-art diffusion model capable of generating high-quality video sequences from static images. This FP16 version provides a balance between model quality and computational efficiency, making it suitable for systems with moderate GPU resources.

Key Capabilities

  • Image-to-Video Generation: Animate static images into coherent video sequences
  • Temporal Coherence: Produces smooth, temporally consistent video frames
  • Motion Control: Advanced control over motion dynamics and camera movements
  • Lighting Preservation: Maintains lighting consistency from source image
  • Quality Enhancement: Support for LoRA adapters for improved output quality
  • Efficient Inference: FP16 precision reduces memory footprint while maintaining quality

Model Architecture

  • Diffusion Framework: Latent diffusion-based video generation
  • Conditioning: Image-conditioned video synthesis
  • Precision: FP16 (half-precision floating point)
  • Format: SafeTensors (secure, efficient format)
  • VAE: Variational Autoencoder for latent space encoding/decoding

Repository Contents

Status: Repository structure prepared for model files (currently empty).

Current Directory Structure

wan25-fp16-i2v/
β”œβ”€β”€ diffusion_models/
β”‚   └── wan/                    # Empty - awaiting model download
β”œβ”€β”€ README.md                   # This file (15 KB)
└── (model files to be added)

Expected Model Files (After Download)

The repository is organized to store WAN 2.5 FP16 I2V model files once downloaded from Hugging Face:

Core Model Files (to be placed in diffusion_models/wan/):

  • wan_2.5_i2v_fp16.safetensors - Main UNet diffusion model for video generation (~8-12 GB)
  • wan_vae_fp16.safetensors - VAE for encoding/decoding video frames (~1-2 GB)
  • image_encoder.safetensors - CLIP/VAE image encoder for conditioning (~1-2 GB)
  • config.json - Model architecture configuration and hyperparameters (~5-10 KB)

Optional LoRA Adapters (to be placed in loras/ directory if downloaded):

  • motion_control_lora.safetensors - Fine-grained motion dynamics control (~100-500 MB)
  • camera_control_lora.safetensors - Camera movement and perspective control (~100-500 MB)
  • quality_enhancement_lora.safetensors - Output quality improvements (~100-500 MB)

Total Repository Size:

  • Current: ~15 KB (documentation only)
  • After Model Download: 10-15 GB (core model) + 0.3-1.5 GB (optional LoRAs)

Download Instructions

To populate this repository with model files:

# Install Hugging Face CLI
pip install huggingface-hub

# Download WAN 2.5 FP16 I2V model (requires HF authentication)
huggingface-cli login
huggingface-cli download Wan/WAN-2.5-I2V --revision fp16 --local-dir "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan"

Hardware Requirements

Minimum Requirements (FP16)

  • GPU: NVIDIA RTX 3090 (24 GB VRAM) or AMD equivalent
  • System RAM: 32 GB
  • Disk Space: 20 GB free space
  • CUDA: 11.8 or higher (for NVIDIA GPUs)

Recommended Requirements

  • GPU: NVIDIA RTX 4090 (24 GB VRAM) or A5000/A6000
  • System RAM: 64 GB
  • Disk Space: 30 GB free space (for model + output cache)
  • CUDA: 12.1 or higher

Performance Expectations

  • Short Videos (2-4 seconds): ~30-60 seconds generation time
  • Medium Videos (5-10 seconds): ~1-3 minutes generation time
  • Long Videos (10-15 seconds): ~3-5 minutes generation time

Generation times vary based on resolution, frame rate, and sampling steps

Usage Examples

Installation

# Install dependencies
pip install diffusers transformers accelerate safetensors torch torchvision pillow

Basic Image-to-Video Generation

import torch
from diffusers import DiffusionPipeline
from PIL import Image

# Load the model
pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipeline.to("cuda")

# Enable memory optimizations
pipeline.enable_attention_slicing()
pipeline.enable_vae_slicing()

# Load source image
image = Image.open("input_image.jpg")

# Generate video from image
prompt = "Add gentle camera pan and natural motion"
video = pipeline(
    image=image,                    # Source image
    prompt=prompt,                  # Optional motion guidance
    num_frames=64,                  # Number of frames to generate
    height=512,                     # Video height
    width=512,                      # Video width
    num_inference_steps=50,         # Sampling steps (higher = better quality)
    guidance_scale=7.5,             # Prompt adherence (higher = closer to prompt)
    image_guidance_scale=1.0        # Image fidelity (higher = closer to source)
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)

Advanced Generation with Motion Control

import torch
from diffusers import DiffusionPipeline
from PIL import Image

# Load model with LoRA support
pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16
)
pipeline.to("cuda")

# Load LoRA adapters for enhanced control
pipeline.load_lora_weights("E:/huggingface/wan25-fp16-i2v/loras", adapter_name="motion_control")
pipeline.load_lora_weights("E:/huggingface/wan25-fp16-i2v/loras", adapter_name="camera_control")

# Enable adapters with specific weights
pipeline.set_adapters(["motion_control", "camera_control"], adapter_weights=[0.8, 0.7])

# Load source image
image = Image.open("landscape.jpg")

# Generate with enhanced control
prompt = "Smooth dolly forward, subtle parallax, cinematic motion"
video = pipeline(
    image=image,
    prompt=prompt,
    num_frames=96,                  # More frames for longer video
    height=768,                     # Higher resolution
    width=768,
    num_inference_steps=75,         # More steps for quality
    guidance_scale=8.0,
    image_guidance_scale=1.2        # Strong image fidelity
).frames

export_to_video(video, "enhanced_output.mp4", fps=12)

Memory-Efficient Generation

import torch
from diffusers import DiffusionPipeline
from PIL import Image

pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16
)
pipeline.to("cuda")

# Enable all memory optimizations
pipeline.enable_attention_slicing()
pipeline.enable_vae_slicing()
pipeline.enable_sequential_cpu_offload()  # Offload to CPU when not in use

# Load and resize image for efficiency
image = Image.open("photo.jpg")
image = image.resize((512, 512))

# Generate with reduced memory footprint
prompt = "Subtle natural motion and breathing life into the scene"
video = pipeline(
    image=image,
    prompt=prompt,
    num_frames=48,                  # Fewer frames for memory efficiency
    height=512,
    width=512,
    num_inference_steps=30,         # Fewer steps for faster generation
    guidance_scale=7.0
).frames

export_to_video(video, "memory_efficient_output.mp4", fps=8)

Batch Processing Multiple Images

import torch
from diffusers import DiffusionPipeline
from PIL import Image
import os

pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16
)
pipeline.to("cuda")
pipeline.enable_attention_slicing()
pipeline.enable_vae_slicing()

# Process multiple images
input_dir = "E:/input_images"
output_dir = "E:/output_videos"
os.makedirs(output_dir, exist_ok=True)

for img_file in os.listdir(input_dir):
    if img_file.endswith(('.jpg', '.png', '.jpeg')):
        # Load image
        image = Image.open(os.path.join(input_dir, img_file))

        # Generate video
        video = pipeline(
            image=image,
            prompt="Cinematic motion, natural dynamics",
            num_frames=64,
            height=512,
            width=512,
            num_inference_steps=40
        ).frames

        # Save with matching name
        output_path = os.path.join(output_dir, f"{os.path.splitext(img_file)[0]}.mp4")
        export_to_video(video, output_path, fps=8)
        print(f"Generated: {output_path}")

Model Specifications

Technical Details

Specification Value
Model Type Latent Diffusion (Image-to-Video)
Precision FP16 (16-bit)
Format SafeTensors
Max Frames 96-128 frames
Resolution 512x512 to 1024x1024
Image Encoder CLIP/VAE-based
VAE Channels 4 (latent)
Sampling DDPM, DDIM, DPM-Solver++

Supported Features

  • βœ… Image-to-video generation
  • βœ… Motion dynamics control
  • βœ… Camera movement control
  • βœ… Prompt-guided motion
  • βœ… Image fidelity preservation
  • βœ… LoRA adapter support
  • βœ… Memory optimization techniques
  • βœ… Batch processing
  • βœ… Custom sampling schedulers
  • βœ… Frame interpolation support

Limitations

  • ⚠️ Video length limited by VRAM (typically 2-15 seconds)
  • ⚠️ Requires significant GPU memory (24 GB minimum recommended)
  • ⚠️ Generation time increases with frame count and resolution
  • ⚠️ Complex motions may require higher sampling steps for coherence
  • ⚠️ Source image quality directly affects output quality
  • ⚠️ Very high contrast or unusual images may produce artifacts

Performance Tips and Optimization

Memory Optimization

  1. Enable Attention Slicing: Reduces VRAM usage at slight speed cost

    pipeline.enable_attention_slicing()
    
  2. Enable VAE Slicing: Processes VAE in smaller chunks

    pipeline.enable_vae_slicing()
    
  3. CPU Offloading: Move model components to CPU when not in use

    pipeline.enable_sequential_cpu_offload()
    
  4. Reduce Resolution: Start with 512x512 for testing, upscale later

  5. Resize Source Images: Preprocess images to target resolution

    image = image.resize((512, 512), Image.LANCZOS)
    

Quality Optimization

  1. Increase Inference Steps: 50-100 steps for higher quality (slower)
  2. Adjust Guidance Scales:
    • guidance_scale: 7.0-9.0 for prompt adherence
    • image_guidance_scale: 1.0-1.5 for image fidelity
  3. Use LoRA Adapters: Enhance motion, camera, and quality aspects
  4. Frame Interpolation: Generate fewer frames, interpolate with RIFE/FILM
  5. High-Quality Source Images: Use clean, well-lit source images

Speed Optimization

  1. Reduce Inference Steps: 20-30 steps for faster generation (lower quality)
  2. Lower Resolution: 512x512 generates 4x faster than 1024x1024
  3. Fewer Frames: Generate 48-64 frames instead of 96-128
  4. Use DPM-Solver++: Faster sampling scheduler
    from diffusers import DPMSolverMultistepScheduler
    pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
    

Prompt Engineering Tips

  • Describe Motion: "gentle pan", "slow zoom", "subtle motion"
  • Camera Movements: "dolly in", "crane up", "orbit around"
  • Motion Quality: "smooth", "cinematic", "natural dynamics"
  • Avoid Contradictions: Keep motion descriptions coherent
  • Optional Prompts: Prompts guide motion; can be empty for automatic motion
  • Scene Context: Reference elements in the source image

Source Image Best Practices

  • Resolution: Use images at or near target video resolution
  • Quality: High-quality, well-exposed images work best
  • Composition: Well-composed images produce better results
  • Lighting: Consistent lighting makes animation more coherent
  • Subject Matter: Clear subjects with defined edges animate better
  • Avoid: Very blurry, low-resolution, or extremely dark images

License

This model is released under a custom WAN license. Please review the license terms before use.

Usage Restrictions

  • βœ… Research and non-commercial use permitted
  • βœ… Educational and academic use permitted
  • ⚠️ Commercial use may require separate licensing
  • ❌ Do not use for generating harmful, misleading, or illegal content
  • ❌ Do not use for deepfakes or impersonation without consent
  • ❌ Respect copyright and intellectual property rights of source images

Please refer to the official WAN model documentation for complete license terms.

Citation

If you use this model in your research or projects, please cite:

@misc{wan25-i2v-fp16,
  title={WAN 2.5: Image-to-Video Diffusion Model},
  author={WAN Team},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Wan/WAN-2.5-I2V}},
  note={FP16 variant}
}

Resources and Links

Official Resources

Community and Support

Related Models

  • WAN 2.5 Text-to-Video: Text-conditioned video generation variant
  • WAN 2.5 FP8: More memory-efficient variant (lower precision)
  • WAN 2.5 Full: Full precision variant (higher quality, more VRAM)
  • FLUX.1: Alternative text-to-image models in this repository

Tutorials and Examples

Version History

v1.4 (2025-10-28)

  • Verified YAML frontmatter compliance with HuggingFace requirements
  • Confirmed repository structure documentation accuracy
  • Validated metadata fields (license, library_name, pipeline_tag, tags)
  • Repository remains prepared for model file downloads

v1.3 (2025-10-14)

  • CRITICAL FIX: Corrected pipeline_tag from text-to-video to image-to-video
  • Updated all documentation to reflect Image-to-Video (I2V) functionality
  • Revised usage examples for image-conditioned generation
  • Added image_guidance_scale parameter documentation
  • Updated tags to include image-to-video
  • Added source image best practices section
  • Corrected model file naming conventions for I2V variant

v1.2 (2025-10-14)

  • Simplified YAML frontmatter to essential fields only per requirements
  • Removed base_model and base_model_relation (base model, not derived)
  • Streamlined tags for better discoverability
  • Verified directory structure (still awaiting model download)

v1.1 (2025-10-14)

  • Updated YAML frontmatter to be first in file
  • Corrected repository contents to reflect actual directory state
  • Added download instructions for model files
  • Clarified that model files are pending download
  • Moved version comment after YAML frontmatter per HuggingFace standards

v1.0 (2025-10-13)

  • Created repository structure
  • Documented expected model files and usage
  • Provided comprehensive usage examples
  • Included hardware requirements and optimization tips

Contact and Contributions

For questions, issues, or contributions related to this repository organization:

  • Local repository maintained for personal use
  • See official WAN model repository for model-specific issues
  • Refer to Hugging Face documentation for diffusers library support

Repository Maintained By: Local User Last Updated: 2025-10-28 README Version: v1.4

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including wangkanai/wan25-fp16-i2v