π₯ News
- Oct 24, 2025: π We release the first unified semantic video generation model, Video-As-Prompt (VAP)!
- Oct 24, 2025: π€ We release the VAP-Data, the largest semantic-controlled video generation datasets with more than $100K$ samples!
- Oct 24, 2025: π We present the technical report of Video-As-Prompt, please check out the details and spark some discussion!
ποΈ Video-As-Prompt
Core idea: Given a reference video with wanted semantics as a video prompt, Video-As-Prompt animate a reference image with the same semantics as the reference video.
E.g., Different Reference Videos + Same Reference Image β New Videos with Different Semantics
Welcome to see our project page for more interesting results!
π Models Zoo
To demonstrate cross-architecture generality, Video-As-Prompt provides two variants, each with distinct trade-offs:
CogVideoX-I2V-5B- Strengths: Fewer backbone parameters let us train more steps under limited resources, yielding strong stability on most semantic conditions.
- Limitations: Due to backbone ability limitation, it is weaker on human-centric generation and on concepts underrepresented in pretraining (e.g., ladudu, Squid Game, Minecraft).
Wan2.1-I2V-14B- Strengths: Strong performance on human actions and novel concepts, thanks to a more capable base model.
- Limitations: Larger model size reduced feasible training steps given our resources, lowering stability on some semantic conditions.
πππ Contributions and further optimization from the community are welcome.
| Model | Date | Size | Huggingface |
|---|---|---|---|
| Video-As-Prompt (CogVideoX-I2V-5B) | 2025-10-15 | 5B (Pretrained DiT) + 5B (VAP) | Download |
| Video-As-Prompt (Wan2.1-I2V-14B) | 2025-10-15 | 14B (Pretrained DiT) + 5B (VAP) | Download |
Please download the pre-trained video DiTs and our corresponding Video-As-Prompt models, and structure them as follows
ckpts/
βββ Video-As-Prompt-CogVideoX-5B/
βββ scheduler
βββ vae
βββ transformer
βββ ...
βββ Video-As-Prompt-Wan2.1-14B/
βββ scheduler
βββ vae
βββ transformer
βββ ...
π€ Get Started with Video-As-Prompt
Video-As-Prompt supports Macos, Windows, Linux. You may follow the next steps to use Video-As-Prompt via:
Install Requirements
We test our model with Python 3.10 and PyTorch 2.7.1+cu124.
conda create -n video_as_prompt python=3.10 -y
conda activate video_as_prompt
pip install -r requirements.txt
pip install -e ./diffusers
conda install -c conda-forge ffmpeg -y
Data
We have published the VAP-Data dataset used in our paper on VAP-Data. Please download it and put it in the data folder. The structure should look like:
data/
βββ VAP-Data/
β βββ vfx_videos/
β βββ vfx_videos_hq/
β βββ vfx_videos_hq_camera/
β βββ benchmark/benchmark.csv
β βββ vap_data.csv
Code Usage
We mainly implement our code based on diffusers and finetrainers for their modular design.
Minimal Demo
Below is a minimal demo of our CogVideoX-I2V-5B variant. The full code can be found in infer/cog_vap.py. The WAN2.1-I2V-14B variant is similar and can be found in infer/wan_vap.py.
import torch
from diffusers import (
AutoencoderKLCogVideoX,
CogVideoXImageToVideoMOTPipeline,
CogVideoXTransformer3DMOTModel,
)
from diffusers.utils import export_to_video, load_video
from PIL import Image
vae = AutoencoderKLCogVideoX.from_pretrained("ByteDance/Video-As-Prompt-CogVideoX-5B", subfolder="vae", torch_dtype=torch.bfloat16)
transformer = CogVideoXTransformer3DMOTModel.from_pretrained("ByteDance/Video-As-Prompt-CogVideoX-5B", torch_dtype=torch.bfloat16)
pipe = CogVideoXImageToVideoMOTPipeline.from_pretrained(
"ByteDance/Video-As-Prompt-CogVideoX-5B", vae=vae, transformer=transformer, torch_dtype=torch.bfloat16
).to("cuda")
ref_video = load_video("assets/videos/demo/object-725.mp4")
image = Image.open("assets/images/demo/animal-2.jpg").convert("RGB")
idx = torch.linspace(0, len(ref_video) - 1, 49).long().tolist()
ref_frames = [ref_video[i] for i in idx]
output_frames = pipe(
image=image,
ref_videos=[ref_frames],
prompt="A chestnut-colored horse stands on a grassy hill against a backdrop of distant, snow-dusted mountains. The horse begins to inflate, its defined, muscular body swelling and rounding into a smooth, balloon-like form while retaining its rich, brown hide color. Without changing its orientation, the now-buoyant horse lifts silently from the ground. It begins a steady vertical ascent, rising straight up and eventually floating out of the top of the frame. The camera remains completely static throughout the entire sequence, holding a fixed shot on the landscape as the horse transforms and departs, ensuring the verdant hill and mountain range in the background stay perfectly still.",
prompt_mot_ref=[
"A hand holds up a single beige sneaker decorated with gold calligraphy and floral illustrations, with small green plants tucked inside. The sneaker immediately begins to inflate like a balloon, its shape distorting as the decorative details stretch and warp across the expanding surface. It rapidly transforms into a perfectly smooth, matte beige sphere, inheriting the primary color from the original shoe. Once the transformation is complete, the new balloon-like object quickly ascends, moving straight up and exiting the top of the frame. The camera remains completely static and the plain white background is unchanged throughout the entire sequence."
],
height=480,
width=720,
num_frames=49,
frames_selection="evenly",
use_dynamic_cfg=True,
).frames[0]
Benchmark Inference
You can alse refer the following code for benchmark inference. Then you can use Vbench to evaluate the results.
python infer/cog_vap_bench.py
python infer/wan_vap_bench.py
Welcome to modify the scripts to see more results in our dataset VAP-Data and even in-the-wild reference videos or images.
Training
Pick a recipe, then run the corresponding script. Each script sets sensible defaults; override as needed.
Recipes β CogVideoX-I2V-5B
| Goal | Nodes | Objective | References / sample | Script |
|---|---|---|---|---|
| Standard SFT | 1 | SFT | 1 | examples/training/sft/cogvideox/vap_mot/train_single_node.sh |
| Standard SFT | β₯2 | SFT | 1 | examples/training/sft/cogvideox/vap_mot/train_multi_node.sh |
| Preference optimization | 1 | DPO | 1 | examples/training/sft/cogvideox/vap_mot/train_single_node_dpo.sh |
| Preference optimization | β₯2 | DPO | 1 | examples/training/sft/cogvideox/vap_mot/train_multi_node_dpo.sh |
| Multi-reference SFT | 1 | SFT | β€3 | examples/training/sft/cogvideox/vap_mot/train_single_node_3ref.sh |
DPO and multi-reference SFT are just our exploration. We provide the code for boost of the community research.
Recipes β Wan2.1-I2V-14B (SFT only)
| Goal | Nodes | Objective | References / sample | Script |
|---|---|---|---|---|
| Standard SFT | 1 | SFT | 1 | examples/training/sft/wan/vap_mot/train_single_node.sh |
| Standard SFT | β₯2 | SFT | 1 | examples/training/sft/wan/vap_mot/train_multi_node.sh |
Quick start (CogVideoX-5B, single-node SFT)
bash examples/training/sft/cogvideox/vap_mot/train_single_node.sh
Quick start (Wan2.1-14B, single-node SFT)
bash examples/training/sft/wan/vap_mot/train_single_node.sh
Multi-node launch (example)
# 6 nodes
bash examples/training/sft/cogvideox/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 0
bash examples/training/sft/cogvideox/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 1
...
bash examples/training/sft/cogvideox/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 5
# or for Wan:
# examples/training/sft/wan/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 0
# examples/training/sft/wan/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 1
...
# examples/training/sft/wan/vap_mot/train_multi_node.sh xxx:xxx:xxx:xxx:xxx(MASTER_ADDR) 5
Notes
- CogVideoX supports SFT, DPO, and a β€3-reference SFT variant; Wan currently supports standard SFT only.
- All scripts read shared config (datasets, output dir, batch size, etc.); edit the script to override.
- Please edit
train_multi_node*.shbase on your environment if you want to change the distributed settings (e.g., gpu num, node num, master addr/port, etc.).
Acknowledgements
We would like to thank the contributors to the Finetrainers, Diffusers, CogVideoX, and Wan repositories, for their open research and exploration.
- Downloads last month
- 53
Model tree for ByteDance/Video-As-Prompt-Wan2.1-14B
Base model
Wan-AI/Wan2.1-I2V-14B-480P-Diffusers