Run Wan2.2 I2V-A14B locally on 8 GB VRAM

---
license: apache-2.0
tags:
- Image-to-Video
- image-to-video
- image
- video
- wan2.2
---

<h1>Run Wan2.2 I2V-A14B locally on 8 GB VRAM</h1>
<ol>
  <li> download model code: https://github.com/nalexand/Wan2.2 </li>
  <li> install all dependencies </li>
  <li> download model weights: huggingface-cli download nalexand/Wan2.2-I2V-A14B-FP8 --local-dir ./Wan2.2-T2V-A14B </li>
  <li> save start frame to: ./last_frame.png </li>
  <li> run: python generate_local.py --task i2v-A14B --size "1280*720" --image=./last_frame.png --ckpt_dir ./Wan2.2-I2V-A14B --prompt "In close-up, a cheetah runs at full speed in a narrow canyon, its golden fur gleaming in the sun, and its black tear marks clearly visible. Shot from a low angle, the cheetah's body is close to the ground, its muscles flowing, and its limbs alternately and powerfully step over stones and soil, stirring up dust. The cheetah's eyes are sharp, staring at the target in front of it, showing unparalleled speed and strength. The camera follows the cheetah's running trajectory, capturing every moment of leaping and turning, showing its amazing agility. The whole scene unfolds in a tense chase rhythm, full of wild charm and competition for survival."</li>
</ol>

<p>Generated frames are limited to 21 (1.3 sec 1280*704) to fit within 8 GB VRAM.</p>
<p>For full 5 second video max resolution: 720*405 (on 8Gb VRAM)</p>
<p></p>* tested on <b>HELIOS PREDATOR 300</b> laptop (3070Ti 8GB) 1280*704 - 72.22 s/it for 21 frames,  60.74 s/it for 17 frames, 13 frames 48.70 s/it </p>


- **Optimized I2V-A14B** run long video generation loop with **loop.bat**
 
https://github.com/user-attachments/assets/154df173-88d3-4ad1-b543-f7410380b13a


## How it works
- edit prompt in **loop.bat** and run (command runs in loop, each iteration do one spep: create latent from image -> y_latents.pt, run inference -> final_latents.pt, decode video final_latents.pt -> last_frame_latents.pt, create latent from last frame last_frame_latents.pt -> y_latents.pt, run inference ...)
- **to start new generation loop** with new image / prompt / frame count / size - delete: **y_latents.pt**, **final_latents.pt**, **last_frame_latents.pt**

## Results on a 3070 Ti laptop GPU with 8 GB VRAM:
                # size 640*352
        # 81 frames             58.23 s/it 51.32 s/it (*FP8)
        # 33 frames             23.75 s/it              vae decode 4.5 sec

        # 704 * 396, sampling_steps 25+
        # frame_num = 49        24.72 s/it (FP16)
        # frame_num = 81        77.50 s/it (FP16)

        # size 720*405, sampling_steps 20+
        # frame_num = 17        21.23 s/it (FP16)
        # frame_num = 77        82.11 s/it (FP16)
        # frame_num = 81 (best) 70.74 s/it (*FP8)        vae decode 12.2 sec

        # size 832*464 / 848*448, sampling_steps 20+
        # frame_num = 17        23.68 s/it               vae decode 3.54 sec
        # frame_num = 53        74.34 s/it
        # 65                    79.73 s/it

        # size 960*540, sampling_steps 16+
        # 17 frames             34.30 s/it (FP16)
        # 41 frames             75.02 s/it (FP16)
        # 45 frames             72.35 s/it (*FP8)       vae decode 11.7 sec

        ######################################################
        # for 8gb vram and sizes > 960*540 vae use slow shared video memory

        # size = 1120 * 630
        # 13 frames         29.24 s/it (*FP8)           vae decode 18 sec
        # 33 frames (max)   85.10 s/it (FP16)           vae decode 57.93sec (10.1Gb)
        # 33 frames         76.49 s/it (*FP8)
        # 37                85.16 s/it (*FP8)           vae decode 75.73sec

        # size 1280*720, sampling_steps 16+
        # 13 frames         48.70 s/it (FP16)           vae decode 99 sec
        # 13 frames         39.61 s/it (*FP8)
        # 17 frames         60.74 s/it (FP16)
        # 17 frames         54.02 s/it (*FP8)
        # 21 frames (max)   72.22 s/it (FP16)
        # 21 frames         66.18 s/it (*FP8)           vae decode 152 sec

        # size 1600*896 / 1568*896, sampling_steps 15+
        # 13 frames (max)   85.47 s/it (FP16)
        # 13 frames (max)   63.88 s/it (*FP8)           vae decode 284 sec

# Compared to ComfyUA
              ComfyUA (fp8)                 This (fp16) optimized vae
    1120*630 33 frames * 16 steps 1470 sec  85.10 s/it * 16 = 1362 sec 
    vae decode                    +117 sec                     +58 sec    
    total                         1587 sec                    1420 sec 1.12x faster 
                                                   
                                            This (*fp8) optimized vae
                                            76.49 s/it * 16 = 1224 sec
                                                               +58 sec
                                                              1282 sec  1.24x faster !!
                                                              
*fp8 - 3070 Ti doesn`t support calculations in fp8, loaded weights in fp8 converting for calculations to fp16 "on the fly"            

Visualy hard to notice diference in quality between fp8 and fp16..

To try FP16 - download original model and use convert_files / optimize_files and put safetensors in ./Wan2.2-I2V-A14B/(low_noise_model/high_noise_model) and swith to false "self.load_as_fp8 = True" -  in image2videolocal.py
more details: [https://github.com/nalexand/Wan2.2](https://github.com/nalexand/Wan2.2)