Run Wan2.2 I2V-A14B locally on 8 GB VRAM

  1. download model code: https://github.com/nalexand/Wan2.2
  2. install all dependencies
  3. download model weights: huggingface-cli download nalexand/Wan2.2-I2V-A14B-FP8 --local-dir ./Wan2.2-T2V-A14B
  4. save start frame to: ./last_frame.png
  5. run: python generate_local.py --task i2v-A14B --size "1280*720" --image=./last_frame.png --ckpt_dir ./Wan2.2-I2V-A14B --prompt "In close-up, a cheetah runs at full speed in a narrow canyon, its golden fur gleaming in the sun, and its black tear marks clearly visible. Shot from a low angle, the cheetah's body is close to the ground, its muscles flowing, and its limbs alternately and powerfully step over stones and soil, stirring up dust. The cheetah's eyes are sharp, staring at the target in front of it, showing unparalleled speed and strength. The camera follows the cheetah's running trajectory, capturing every moment of leaping and turning, showing its amazing agility. The whole scene unfolds in a tense chase rhythm, full of wild charm and competition for survival."

Generated frames are limited to 21 (1.3 sec 1280*704) to fit within 8 GB VRAM.

For full 5 second video max resolution: 720*405 (on 8Gb VRAM)

* tested on HELIOS PREDATOR 300 laptop (3070Ti 8GB) 1280*704 - 72.22 s/it for 21 frames, 60.74 s/it for 17 frames, 13 frames 48.70 s/it

  • Optimized I2V-A14B run long video generation loop with loop.bat

https://github.com/user-attachments/assets/154df173-88d3-4ad1-b543-f7410380b13a

How it works

  • edit prompt in loop.bat and run (command runs in loop, each iteration do one spep: create latent from image -> y_latents.pt, run inference -> final_latents.pt, decode video final_latents.pt -> last_frame_latents.pt, create latent from last frame last_frame_latents.pt -> y_latents.pt, run inference ...)
  • to start new generation loop with new image / prompt / frame count / size - delete: y_latents.pt, final_latents.pt, last_frame_latents.pt

Results on a 3070 Ti laptop GPU with 8 GB VRAM:

            # size 640*352
    # 81 frames             58.23 s/it 51.32 s/it (*FP8)
    # 33 frames             23.75 s/it              vae decode 4.5 sec

    # 704 * 396, sampling_steps 25+
    # frame_num = 49        24.72 s/it (FP16)
    # frame_num = 81        77.50 s/it (FP16)

    # size 720*405, sampling_steps 20+
    # frame_num = 17        21.23 s/it (FP16)
    # frame_num = 77        82.11 s/it (FP16)
    # frame_num = 81 (best) 70.74 s/it (*FP8)        vae decode 12.2 sec

    # size 832*464 / 848*448, sampling_steps 20+
    # frame_num = 17        23.68 s/it               vae decode 3.54 sec
    # frame_num = 53        74.34 s/it
    # 65                    79.73 s/it

    # size 960*540, sampling_steps 16+
    # 17 frames             34.30 s/it (FP16)
    # 41 frames             75.02 s/it (FP16)
    # 45 frames             72.35 s/it (*FP8)       vae decode 11.7 sec

    ######################################################
    # for 8gb vram and sizes > 960*540 vae use slow shared video memory

    # size = 1120 * 630
    # 13 frames         29.24 s/it (*FP8)           vae decode 18 sec
    # 33 frames (max)   85.10 s/it (FP16)           vae decode 57.93sec (10.1Gb)
    # 33 frames         76.49 s/it (*FP8)
    # 37                85.16 s/it (*FP8)           vae decode 75.73sec

    # size 1280*720, sampling_steps 16+
    # 13 frames         48.70 s/it (FP16)           vae decode 99 sec
    # 13 frames         39.61 s/it (*FP8)
    # 17 frames         60.74 s/it (FP16)
    # 17 frames         54.02 s/it (*FP8)
    # 21 frames (max)   72.22 s/it (FP16)
    # 21 frames         66.18 s/it (*FP8)           vae decode 152 sec

    # size 1600*896 / 1568*896, sampling_steps 15+
    # 13 frames (max)   85.47 s/it (FP16)
    # 13 frames (max)   63.88 s/it (*FP8)           vae decode 284 sec

Compared to ComfyUA

          ComfyUA (fp8)                 This (fp16) optimized vae
1120*630 33 frames * 16 steps 1470 sec  85.10 s/it * 16 = 1362 sec 
vae decode                    +117 sec                     +58 sec    
total                         1587 sec                    1420 sec 1.12x faster 
                                               
                                        This (*fp8) optimized vae
                                        76.49 s/it * 16 = 1224 sec
                                                           +58 sec
                                                          1282 sec  1.24x faster !!
                                                          

*fp8 - 3070 Ti doesn`t support calculations in fp8, loaded weights in fp8 converting for calculations to fp16 "on the fly"

Visualy hard to notice diference in quality between fp8 and fp16..

To try FP16 - download original model and use convert_files / optimize_files and put safetensors in ./Wan2.2-I2V-A14B/(low_noise_model/high_noise_model) and swith to false "self.load_as_fp8 = True" - in image2videolocal.py more details: https://github.com/nalexand/Wan2.2

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support