--- license: other library_name: diffusers pipeline_tag: image-to-video tags: - wan - image-to-video - video-generation --- # WAN 2.5 FP16 - Image-to-Video Generation Model **Version**: v1.4 **Precision**: FP16 (16-bit floating point) **Model Family**: WAN (Video Generation) **Task**: Image-to-Video Generation ## Model Description WAN 2.5 Image-to-Video (I2V) is a state-of-the-art diffusion model capable of generating high-quality video sequences from static images. This FP16 version provides a balance between model quality and computational efficiency, making it suitable for systems with moderate GPU resources. ### Key Capabilities - **Image-to-Video Generation**: Animate static images into coherent video sequences - **Temporal Coherence**: Produces smooth, temporally consistent video frames - **Motion Control**: Advanced control over motion dynamics and camera movements - **Lighting Preservation**: Maintains lighting consistency from source image - **Quality Enhancement**: Support for LoRA adapters for improved output quality - **Efficient Inference**: FP16 precision reduces memory footprint while maintaining quality ### Model Architecture - **Diffusion Framework**: Latent diffusion-based video generation - **Conditioning**: Image-conditioned video synthesis - **Precision**: FP16 (half-precision floating point) - **Format**: SafeTensors (secure, efficient format) - **VAE**: Variational Autoencoder for latent space encoding/decoding ## Repository Contents **Status**: Repository structure prepared for model files (currently empty). ### Current Directory Structure ``` wan25-fp16-i2v/ ├── diffusion_models/ │ └── wan/ # Empty - awaiting model download ├── README.md # This file (15 KB) └── (model files to be added) ``` ### Expected Model Files (After Download) The repository is organized to store WAN 2.5 FP16 I2V model files once downloaded from Hugging Face: **Core Model Files** (to be placed in `diffusion_models/wan/`): - `wan_2.5_i2v_fp16.safetensors` - Main UNet diffusion model for video generation (~8-12 GB) - `wan_vae_fp16.safetensors` - VAE for encoding/decoding video frames (~1-2 GB) - `image_encoder.safetensors` - CLIP/VAE image encoder for conditioning (~1-2 GB) - `config.json` - Model architecture configuration and hyperparameters (~5-10 KB) **Optional LoRA Adapters** (to be placed in `loras/` directory if downloaded): - `motion_control_lora.safetensors` - Fine-grained motion dynamics control (~100-500 MB) - `camera_control_lora.safetensors` - Camera movement and perspective control (~100-500 MB) - `quality_enhancement_lora.safetensors` - Output quality improvements (~100-500 MB) **Total Repository Size**: - Current: ~15 KB (documentation only) - After Model Download: 10-15 GB (core model) + 0.3-1.5 GB (optional LoRAs) ### Download Instructions To populate this repository with model files: ```bash # Install Hugging Face CLI pip install huggingface-hub # Download WAN 2.5 FP16 I2V model (requires HF authentication) huggingface-cli login huggingface-cli download Wan/WAN-2.5-I2V --revision fp16 --local-dir "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan" ``` ## Hardware Requirements ### Minimum Requirements (FP16) - **GPU**: NVIDIA RTX 3090 (24 GB VRAM) or AMD equivalent - **System RAM**: 32 GB - **Disk Space**: 20 GB free space - **CUDA**: 11.8 or higher (for NVIDIA GPUs) ### Recommended Requirements - **GPU**: NVIDIA RTX 4090 (24 GB VRAM) or A5000/A6000 - **System RAM**: 64 GB - **Disk Space**: 30 GB free space (for model + output cache) - **CUDA**: 12.1 or higher ### Performance Expectations - **Short Videos (2-4 seconds)**: ~30-60 seconds generation time - **Medium Videos (5-10 seconds)**: ~1-3 minutes generation time - **Long Videos (10-15 seconds)**: ~3-5 minutes generation time *Generation times vary based on resolution, frame rate, and sampling steps* ## Usage Examples ### Installation ```bash # Install dependencies pip install diffusers transformers accelerate safetensors torch torchvision pillow ``` ### Basic Image-to-Video Generation ```python import torch from diffusers import DiffusionPipeline from PIL import Image # Load the model pipeline = DiffusionPipeline.from_pretrained( "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan", torch_dtype=torch.float16, variant="fp16" ) pipeline.to("cuda") # Enable memory optimizations pipeline.enable_attention_slicing() pipeline.enable_vae_slicing() # Load source image image = Image.open("input_image.jpg") # Generate video from image prompt = "Add gentle camera pan and natural motion" video = pipeline( image=image, # Source image prompt=prompt, # Optional motion guidance num_frames=64, # Number of frames to generate height=512, # Video height width=512, # Video width num_inference_steps=50, # Sampling steps (higher = better quality) guidance_scale=7.5, # Prompt adherence (higher = closer to prompt) image_guidance_scale=1.0 # Image fidelity (higher = closer to source) ).frames # Save video from diffusers.utils import export_to_video export_to_video(video, "output.mp4", fps=8) ``` ### Advanced Generation with Motion Control ```python import torch from diffusers import DiffusionPipeline from PIL import Image # Load model with LoRA support pipeline = DiffusionPipeline.from_pretrained( "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan", torch_dtype=torch.float16 ) pipeline.to("cuda") # Load LoRA adapters for enhanced control pipeline.load_lora_weights("E:/huggingface/wan25-fp16-i2v/loras", adapter_name="motion_control") pipeline.load_lora_weights("E:/huggingface/wan25-fp16-i2v/loras", adapter_name="camera_control") # Enable adapters with specific weights pipeline.set_adapters(["motion_control", "camera_control"], adapter_weights=[0.8, 0.7]) # Load source image image = Image.open("landscape.jpg") # Generate with enhanced control prompt = "Smooth dolly forward, subtle parallax, cinematic motion" video = pipeline( image=image, prompt=prompt, num_frames=96, # More frames for longer video height=768, # Higher resolution width=768, num_inference_steps=75, # More steps for quality guidance_scale=8.0, image_guidance_scale=1.2 # Strong image fidelity ).frames export_to_video(video, "enhanced_output.mp4", fps=12) ``` ### Memory-Efficient Generation ```python import torch from diffusers import DiffusionPipeline from PIL import Image pipeline = DiffusionPipeline.from_pretrained( "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan", torch_dtype=torch.float16 ) pipeline.to("cuda") # Enable all memory optimizations pipeline.enable_attention_slicing() pipeline.enable_vae_slicing() pipeline.enable_sequential_cpu_offload() # Offload to CPU when not in use # Load and resize image for efficiency image = Image.open("photo.jpg") image = image.resize((512, 512)) # Generate with reduced memory footprint prompt = "Subtle natural motion and breathing life into the scene" video = pipeline( image=image, prompt=prompt, num_frames=48, # Fewer frames for memory efficiency height=512, width=512, num_inference_steps=30, # Fewer steps for faster generation guidance_scale=7.0 ).frames export_to_video(video, "memory_efficient_output.mp4", fps=8) ``` ### Batch Processing Multiple Images ```python import torch from diffusers import DiffusionPipeline from PIL import Image import os pipeline = DiffusionPipeline.from_pretrained( "E:/huggingface/wan25-fp16-i2v/diffusion_models/wan", torch_dtype=torch.float16 ) pipeline.to("cuda") pipeline.enable_attention_slicing() pipeline.enable_vae_slicing() # Process multiple images input_dir = "E:/input_images" output_dir = "E:/output_videos" os.makedirs(output_dir, exist_ok=True) for img_file in os.listdir(input_dir): if img_file.endswith(('.jpg', '.png', '.jpeg')): # Load image image = Image.open(os.path.join(input_dir, img_file)) # Generate video video = pipeline( image=image, prompt="Cinematic motion, natural dynamics", num_frames=64, height=512, width=512, num_inference_steps=40 ).frames # Save with matching name output_path = os.path.join(output_dir, f"{os.path.splitext(img_file)[0]}.mp4") export_to_video(video, output_path, fps=8) print(f"Generated: {output_path}") ``` ## Model Specifications ### Technical Details | Specification | Value | |--------------|-------| | **Model Type** | Latent Diffusion (Image-to-Video) | | **Precision** | FP16 (16-bit) | | **Format** | SafeTensors | | **Max Frames** | 96-128 frames | | **Resolution** | 512x512 to 1024x1024 | | **Image Encoder** | CLIP/VAE-based | | **VAE Channels** | 4 (latent) | | **Sampling** | DDPM, DDIM, DPM-Solver++ | ### Supported Features - ✅ Image-to-video generation - ✅ Motion dynamics control - ✅ Camera movement control - ✅ Prompt-guided motion - ✅ Image fidelity preservation - ✅ LoRA adapter support - ✅ Memory optimization techniques - ✅ Batch processing - ✅ Custom sampling schedulers - ✅ Frame interpolation support ### Limitations - ⚠️ Video length limited by VRAM (typically 2-15 seconds) - ⚠️ Requires significant GPU memory (24 GB minimum recommended) - ⚠️ Generation time increases with frame count and resolution - ⚠️ Complex motions may require higher sampling steps for coherence - ⚠️ Source image quality directly affects output quality - ⚠️ Very high contrast or unusual images may produce artifacts ## Performance Tips and Optimization ### Memory Optimization 1. **Enable Attention Slicing**: Reduces VRAM usage at slight speed cost ```python pipeline.enable_attention_slicing() ``` 2. **Enable VAE Slicing**: Processes VAE in smaller chunks ```python pipeline.enable_vae_slicing() ``` 3. **CPU Offloading**: Move model components to CPU when not in use ```python pipeline.enable_sequential_cpu_offload() ``` 4. **Reduce Resolution**: Start with 512x512 for testing, upscale later 5. **Resize Source Images**: Preprocess images to target resolution ```python image = image.resize((512, 512), Image.LANCZOS) ``` ### Quality Optimization 1. **Increase Inference Steps**: 50-100 steps for higher quality (slower) 2. **Adjust Guidance Scales**: - `guidance_scale`: 7.0-9.0 for prompt adherence - `image_guidance_scale`: 1.0-1.5 for image fidelity 3. **Use LoRA Adapters**: Enhance motion, camera, and quality aspects 4. **Frame Interpolation**: Generate fewer frames, interpolate with RIFE/FILM 5. **High-Quality Source Images**: Use clean, well-lit source images ### Speed Optimization 1. **Reduce Inference Steps**: 20-30 steps for faster generation (lower quality) 2. **Lower Resolution**: 512x512 generates 4x faster than 1024x1024 3. **Fewer Frames**: Generate 48-64 frames instead of 96-128 4. **Use DPM-Solver++**: Faster sampling scheduler ```python from diffusers import DPMSolverMultistepScheduler pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) ``` ### Prompt Engineering Tips - **Describe Motion**: "gentle pan", "slow zoom", "subtle motion" - **Camera Movements**: "dolly in", "crane up", "orbit around" - **Motion Quality**: "smooth", "cinematic", "natural dynamics" - **Avoid Contradictions**: Keep motion descriptions coherent - **Optional Prompts**: Prompts guide motion; can be empty for automatic motion - **Scene Context**: Reference elements in the source image ### Source Image Best Practices - **Resolution**: Use images at or near target video resolution - **Quality**: High-quality, well-exposed images work best - **Composition**: Well-composed images produce better results - **Lighting**: Consistent lighting makes animation more coherent - **Subject Matter**: Clear subjects with defined edges animate better - **Avoid**: Very blurry, low-resolution, or extremely dark images ## License This model is released under a custom WAN license. Please review the license terms before use. ### Usage Restrictions - ✅ Research and non-commercial use permitted - ✅ Educational and academic use permitted - ⚠️ Commercial use may require separate licensing - ❌ Do not use for generating harmful, misleading, or illegal content - ❌ Do not use for deepfakes or impersonation without consent - ❌ Respect copyright and intellectual property rights of source images Please refer to the official WAN model documentation for complete license terms. ## Citation If you use this model in your research or projects, please cite: ```bibtex @misc{wan25-i2v-fp16, title={WAN 2.5: Image-to-Video Diffusion Model}, author={WAN Team}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/Wan/WAN-2.5-I2V}}, note={FP16 variant} } ``` ## Resources and Links ### Official Resources - **Hugging Face Model Card**: https://huggingface.co/Wan/WAN-2.5-I2V - **WAN Official Documentation**: [Link to official docs when available] - **Model Paper**: [ArXiv link when available] ### Community and Support - **Hugging Face Forums**: https://discuss.huggingface.co/ - **GitHub Issues**: [Repository link when available] - **Discord Community**: [Discord invite when available] ### Related Models - **WAN 2.5 Text-to-Video**: Text-conditioned video generation variant - **WAN 2.5 FP8**: More memory-efficient variant (lower precision) - **WAN 2.5 Full**: Full precision variant (higher quality, more VRAM) - **FLUX.1**: Alternative text-to-image models in this repository ### Tutorials and Examples - **Diffusers Documentation**: https://huggingface.co/docs/diffusers - **Image-to-Video Guide**: https://huggingface.co/docs/diffusers/using-diffusers/image-to-video - **LoRA Training Guide**: https://huggingface.co/docs/diffusers/training/lora ## Version History ### v1.4 (2025-10-28) - Verified YAML frontmatter compliance with HuggingFace requirements - Confirmed repository structure documentation accuracy - Validated metadata fields (license, library_name, pipeline_tag, tags) - Repository remains prepared for model file downloads ### v1.3 (2025-10-14) - **CRITICAL FIX**: Corrected pipeline_tag from `text-to-video` to `image-to-video` - Updated all documentation to reflect Image-to-Video (I2V) functionality - Revised usage examples for image-conditioned generation - Added `image_guidance_scale` parameter documentation - Updated tags to include `image-to-video` - Added source image best practices section - Corrected model file naming conventions for I2V variant ### v1.2 (2025-10-14) - Simplified YAML frontmatter to essential fields only per requirements - Removed base_model and base_model_relation (base model, not derived) - Streamlined tags for better discoverability - Verified directory structure (still awaiting model download) ### v1.1 (2025-10-14) - Updated YAML frontmatter to be first in file - Corrected repository contents to reflect actual directory state - Added download instructions for model files - Clarified that model files are pending download - Moved version comment after YAML frontmatter per HuggingFace standards ### v1.0 (2025-10-13) - Created repository structure - Documented expected model files and usage - Provided comprehensive usage examples - Included hardware requirements and optimization tips ## Contact and Contributions For questions, issues, or contributions related to this repository organization: - Local repository maintained for personal use - See official WAN model repository for model-specific issues - Refer to Hugging Face documentation for diffusers library support --- **Repository Maintained By**: Local User **Last Updated**: 2025-10-28 **README Version**: v1.4