---
license: other
library_name: diffusers
pipeline_tag: image-to-video
tags:
  - wan
  - image-to-video
  - video-generation
---

<!-- README Version: v1.4 -->

# WAN 2.5 FP16 - Image-to-Video Generation Model

**Version**: v1.4
**Precision**: FP16 (16-bit floating point)
**Model Family**: WAN (Video Generation)
**Task**: Image-to-Video Generation

## Model Description

WAN 2.5 Image-to-Video (I2V) is a state-of-the-art diffusion model capable of generating high-quality video sequences from static images. This FP16 version provides a balance between model quality and computational efficiency, making it suitable for systems with moderate GPU resources.

### Key Capabilities

- **Image-to-Video Generation**: Animate static images into coherent video sequences
- **Temporal Coherence**: Produces smooth, temporally consistent video frames
- **Motion Control**: Advanced control over motion dynamics and camera movements
- **Lighting Preservation**: Maintains lighting consistency from source image
- **Quality Enhancement**: Support for LoRA adapters for improved output quality
- **Efficient Inference**: FP16 precision reduces memory footprint while maintaining quality

### Model Architecture

- **Diffusion Framework**: Latent diffusion-based video generation
- **Conditioning**: Image-conditioned video synthesis
- **Precision**: FP16 (half-precision floating point)
- **Format**: SafeTensors (secure, efficient format)
- **VAE**: Variational Autoencoder for latent space encoding/decoding

## Repository Contents

**Status**: Repository structure prepared for model files (currently empty).

### Current Directory Structure

```
wan25-fp16-i2v/
├── diffusion_models/
│   └── wan/                    # Empty - awaiting model download
├── README.md                   # This file (15 KB)
└── (model files to be added)
```

### Expected Model Files (After Download)

The repository is organized to store WAN 2.5 FP16 I2V model files once downloaded from Hugging Face:

**Core Model Files** (to be placed in `diffusion_models/wan/`):
- `wan_2.5_i2v_fp16.safetensors` - Main UNet diffusion model for video generation (~8-12 GB)
- `wan_vae_fp16.safetensors` - VAE for encoding/decoding video frames (~1-2 GB)
- `image_encoder.safetensors` - CLIP/VAE image encoder for conditioning (~1-2 GB)
- `config.json` - Model architecture configuration and hyperparameters (~5-10 KB)

**Optional LoRA Adapters** (to be placed in `loras/` directory if downloaded):
- `motion_control_lora.safetensors` - Fine-grained motion dynamics control (~100-500 MB)
- `camera_control_lora.safetensors` - Camera movement and perspective control (~100-500 MB)
- `quality_enhancement_lora.safetensors` - Output quality improvements (~100-500 MB)

**Total Repository Size**:
- Current: ~15 KB (documentation only)
- After Model Download: 10-15 GB (core model) + 0.3-1.5 GB (optional LoRAs)

### Download Instructions

To populate this repository with model files:

```bash
# Install Hugging Face CLI
pip install huggingface-hub

# Download WAN 2.5 FP16 I2V model (requires HF authentication)
huggingface-cli login
huggingface-cli download Wan/WAN-2.5-I2V --revision fp16 --local-dir "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan"
```

## Hardware Requirements

### Minimum Requirements (FP16)
- **GPU**: NVIDIA RTX 3090 (24 GB VRAM) or AMD equivalent
- **System RAM**: 32 GB
- **Disk Space**: 20 GB free space
- **CUDA**: 11.8 or higher (for NVIDIA GPUs)

### Recommended Requirements
- **GPU**: NVIDIA RTX 4090 (24 GB VRAM) or A5000/A6000
- **System RAM**: 64 GB
- **Disk Space**: 30 GB free space (for model + output cache)
- **CUDA**: 12.1 or higher

### Performance Expectations
- **Short Videos (2-4 seconds)**: ~30-60 seconds generation time
- **Medium Videos (5-10 seconds)**: ~1-3 minutes generation time
- **Long Videos (10-15 seconds)**: ~3-5 minutes generation time

*Generation times vary based on resolution, frame rate, and sampling steps*

## Usage Examples

### Installation

```bash
# Install dependencies
pip install diffusers transformers accelerate safetensors torch torchvision pillow
```

### Basic Image-to-Video Generation

```python
import torch
from diffusers import DiffusionPipeline
from PIL import Image

# Load the model
pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipeline.to("cuda")

# Enable memory optimizations
pipeline.enable_attention_slicing()
pipeline.enable_vae_slicing()

# Load source image
image = Image.open("input_image.jpg")

# Generate video from image
prompt = "Add gentle camera pan and natural motion"
video = pipeline(
    image=image,                    # Source image
    prompt=prompt,                  # Optional motion guidance
    num_frames=64,                  # Number of frames to generate
    height=512,                     # Video height
    width=512,                      # Video width
    num_inference_steps=50,         # Sampling steps (higher = better quality)
    guidance_scale=7.5,             # Prompt adherence (higher = closer to prompt)
    image_guidance_scale=1.0        # Image fidelity (higher = closer to source)
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)
```

### Advanced Generation with Motion Control

```python
import torch
from diffusers import DiffusionPipeline
from PIL import Image

# Load model with LoRA support
pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16
)
pipeline.to("cuda")

# Load LoRA adapters for enhanced control
pipeline.load_lora_weights("E:/huggingface/wan25-fp16-i2v/loras", adapter_name="motion_control")
pipeline.load_lora_weights("E:/huggingface/wan25-fp16-i2v/loras", adapter_name="camera_control")

# Enable adapters with specific weights
pipeline.set_adapters(["motion_control", "camera_control"], adapter_weights=[0.8, 0.7])

# Load source image
image = Image.open("landscape.jpg")

# Generate with enhanced control
prompt = "Smooth dolly forward, subtle parallax, cinematic motion"
video = pipeline(
    image=image,
    prompt=prompt,
    num_frames=96,                  # More frames for longer video
    height=768,                     # Higher resolution
    width=768,
    num_inference_steps=75,         # More steps for quality
    guidance_scale=8.0,
    image_guidance_scale=1.2        # Strong image fidelity
).frames

export_to_video(video, "enhanced_output.mp4", fps=12)
```

### Memory-Efficient Generation

```python
import torch
from diffusers import DiffusionPipeline
from PIL import Image

pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16
)
pipeline.to("cuda")

# Enable all memory optimizations
pipeline.enable_attention_slicing()
pipeline.enable_vae_slicing()
pipeline.enable_sequential_cpu_offload()  # Offload to CPU when not in use

# Load and resize image for efficiency
image = Image.open("photo.jpg")
image = image.resize((512, 512))

# Generate with reduced memory footprint
prompt = "Subtle natural motion and breathing life into the scene"
video = pipeline(
    image=image,
    prompt=prompt,
    num_frames=48,                  # Fewer frames for memory efficiency
    height=512,
    width=512,
    num_inference_steps=30,         # Fewer steps for faster generation
    guidance_scale=7.0
).frames

export_to_video(video, "memory_efficient_output.mp4", fps=8)
```

### Batch Processing Multiple Images

```python
import torch
from diffusers import DiffusionPipeline
from PIL import Image
import os

pipeline = DiffusionPipeline.from_pretrained(
    "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan",
    torch_dtype=torch.float16
)
pipeline.to("cuda")
pipeline.enable_attention_slicing()
pipeline.enable_vae_slicing()

# Process multiple images
input_dir = "E:/input_images"
output_dir = "E:/output_videos"
os.makedirs(output_dir, exist_ok=True)

for img_file in os.listdir(input_dir):
    if img_file.endswith(('.jpg', '.png', '.jpeg')):
        # Load image
        image = Image.open(os.path.join(input_dir, img_file))

        # Generate video
        video = pipeline(
            image=image,
            prompt="Cinematic motion, natural dynamics",
            num_frames=64,
            height=512,
            width=512,
            num_inference_steps=40
        ).frames

        # Save with matching name
        output_path = os.path.join(output_dir, f"{os.path.splitext(img_file)[0]}.mp4")
        export_to_video(video, output_path, fps=8)
        print(f"Generated: {output_path}")
```

## Model Specifications

### Technical Details

| Specification | Value |
|--------------|-------|
| **Model Type** | Latent Diffusion (Image-to-Video) |
| **Precision** | FP16 (16-bit) |
| **Format** | SafeTensors |
| **Max Frames** | 96-128 frames |
| **Resolution** | 512x512 to 1024x1024 |
| **Image Encoder** | CLIP/VAE-based |
| **VAE Channels** | 4 (latent) |
| **Sampling** | DDPM, DDIM, DPM-Solver++ |

### Supported Features

- ✅ Image-to-video generation
- ✅ Motion dynamics control
- ✅ Camera movement control
- ✅ Prompt-guided motion
- ✅ Image fidelity preservation
- ✅ LoRA adapter support
- ✅ Memory optimization techniques
- ✅ Batch processing
- ✅ Custom sampling schedulers
- ✅ Frame interpolation support

### Limitations

- ⚠️ Video length limited by VRAM (typically 2-15 seconds)
- ⚠️ Requires significant GPU memory (24 GB minimum recommended)
- ⚠️ Generation time increases with frame count and resolution
- ⚠️ Complex motions may require higher sampling steps for coherence
- ⚠️ Source image quality directly affects output quality
- ⚠️ Very high contrast or unusual images may produce artifacts

## Performance Tips and Optimization

### Memory Optimization

1. **Enable Attention Slicing**: Reduces VRAM usage at slight speed cost
   ```python
   pipeline.enable_attention_slicing()
   ```

2. **Enable VAE Slicing**: Processes VAE in smaller chunks
   ```python
   pipeline.enable_vae_slicing()
   ```

3. **CPU Offloading**: Move model components to CPU when not in use
   ```python
   pipeline.enable_sequential_cpu_offload()
   ```

4. **Reduce Resolution**: Start with 512x512 for testing, upscale later

5. **Resize Source Images**: Preprocess images to target resolution
   ```python
   image = image.resize((512, 512), Image.LANCZOS)
   ```

### Quality Optimization

1. **Increase Inference Steps**: 50-100 steps for higher quality (slower)
2. **Adjust Guidance Scales**:
   - `guidance_scale`: 7.0-9.0 for prompt adherence
   - `image_guidance_scale`: 1.0-1.5 for image fidelity
3. **Use LoRA Adapters**: Enhance motion, camera, and quality aspects
4. **Frame Interpolation**: Generate fewer frames, interpolate with RIFE/FILM
5. **High-Quality Source Images**: Use clean, well-lit source images

### Speed Optimization

1. **Reduce Inference Steps**: 20-30 steps for faster generation (lower quality)
2. **Lower Resolution**: 512x512 generates 4x faster than 1024x1024
3. **Fewer Frames**: Generate 48-64 frames instead of 96-128
4. **Use DPM-Solver++**: Faster sampling scheduler
   ```python
   from diffusers import DPMSolverMultistepScheduler
   pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
   ```

### Prompt Engineering Tips

- **Describe Motion**: "gentle pan", "slow zoom", "subtle motion"
- **Camera Movements**: "dolly in", "crane up", "orbit around"
- **Motion Quality**: "smooth", "cinematic", "natural dynamics"
- **Avoid Contradictions**: Keep motion descriptions coherent
- **Optional Prompts**: Prompts guide motion; can be empty for automatic motion
- **Scene Context**: Reference elements in the source image

### Source Image Best Practices

- **Resolution**: Use images at or near target video resolution
- **Quality**: High-quality, well-exposed images work best
- **Composition**: Well-composed images produce better results
- **Lighting**: Consistent lighting makes animation more coherent
- **Subject Matter**: Clear subjects with defined edges animate better
- **Avoid**: Very blurry, low-resolution, or extremely dark images

## License

This model is released under a custom WAN license. Please review the license terms before use.

### Usage Restrictions

- ✅ Research and non-commercial use permitted
- ✅ Educational and academic use permitted
- ⚠️ Commercial use may require separate licensing
- ❌ Do not use for generating harmful, misleading, or illegal content
- ❌ Do not use for deepfakes or impersonation without consent
- ❌ Respect copyright and intellectual property rights of source images

Please refer to the official WAN model documentation for complete license terms.

## Citation

If you use this model in your research or projects, please cite:

```bibtex
@misc{wan25-i2v-fp16,
  title={WAN 2.5: Image-to-Video Diffusion Model},
  author={WAN Team},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Wan/WAN-2.5-I2V}},
  note={FP16 variant}
}
```

## Resources and Links

### Official Resources
- **Hugging Face Model Card**: https://huggingface.co/Wan/WAN-2.5-I2V
- **WAN Official Documentation**: [Link to official docs when available]
- **Model Paper**: [ArXiv link when available]

### Community and Support
- **Hugging Face Forums**: https://discuss.huggingface.co/
- **GitHub Issues**: [Repository link when available]
- **Discord Community**: [Discord invite when available]

### Related Models
- **WAN 2.5 Text-to-Video**: Text-conditioned video generation variant
- **WAN 2.5 FP8**: More memory-efficient variant (lower precision)
- **WAN 2.5 Full**: Full precision variant (higher quality, more VRAM)
- **FLUX.1**: Alternative text-to-image models in this repository

### Tutorials and Examples
- **Diffusers Documentation**: https://huggingface.co/docs/diffusers
- **Image-to-Video Guide**: https://huggingface.co/docs/diffusers/using-diffusers/image-to-video
- **LoRA Training Guide**: https://huggingface.co/docs/diffusers/training/lora

## Version History

### v1.4 (2025-10-28)
- Verified YAML frontmatter compliance with HuggingFace requirements
- Confirmed repository structure documentation accuracy
- Validated metadata fields (license, library_name, pipeline_tag, tags)
- Repository remains prepared for model file downloads

### v1.3 (2025-10-14)
- **CRITICAL FIX**: Corrected pipeline_tag from `text-to-video` to `image-to-video`
- Updated all documentation to reflect Image-to-Video (I2V) functionality
- Revised usage examples for image-conditioned generation
- Added `image_guidance_scale` parameter documentation
- Updated tags to include `image-to-video`
- Added source image best practices section
- Corrected model file naming conventions for I2V variant

### v1.2 (2025-10-14)
- Simplified YAML frontmatter to essential fields only per requirements
- Removed base_model and base_model_relation (base model, not derived)
- Streamlined tags for better discoverability
- Verified directory structure (still awaiting model download)

### v1.1 (2025-10-14)
- Updated YAML frontmatter to be first in file
- Corrected repository contents to reflect actual directory state
- Added download instructions for model files
- Clarified that model files are pending download
- Moved version comment after YAML frontmatter per HuggingFace standards

### v1.0 (2025-10-13)
- Created repository structure
- Documented expected model files and usage
- Provided comprehensive usage examples
- Included hardware requirements and optimization tips

## Contact and Contributions

For questions, issues, or contributions related to this repository organization:
- Local repository maintained for personal use
- See official WAN model repository for model-specific issues
- Refer to Hugging Face documentation for diffusers library support

---

**Repository Maintained By**: Local User
**Last Updated**: 2025-10-28
**README Version**: v1.4