Update README.md
Browse files
README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
-
license:
|
| 5 |
pipeline_tag: text-to-video
|
| 6 |
tags:
|
| 7 |
- video-generation
|
|
@@ -124,4 +124,169 @@ CogVideoX is an open-source video generation model similar to [QingYing](https:/
|
|
| 124 |
</tr>
|
| 125 |
</table>
|
| 126 |
|
| 127 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
+
license: other
|
| 5 |
pipeline_tag: text-to-video
|
| 6 |
tags:
|
| 7 |
- video-generation
|
|
|
|
| 124 |
</tr>
|
| 125 |
</table>
|
| 126 |
|
| 127 |
+
**Data Explanation**
|
| 128 |
+
|
| 129 |
+
+ Testing with the `diffusers` library enabled all optimizations included in the library. This scheme has not been
|
| 130 |
+
tested on non-NVIDIA A100/H100 devices. It should generally work with all NVIDIA Ampere architecture or higher
|
| 131 |
+
devices. Disabling optimizations can triple VRAM usage but increase speed by 3-4 times. You can selectively disable
|
| 132 |
+
certain optimizations, including:
|
| 133 |
+
|
| 134 |
+
```
|
| 135 |
+
pipe.enable_sequential_cpu_offload()
|
| 136 |
+
pipe.vae.enable_slicing()
|
| 137 |
+
pipe.vae.enable_tiling()
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
+ In multi-GPU inference, `enable_sequential_cpu_offload()` optimization needs to be disabled.
|
| 141 |
+
+ Using an INT8 model reduces inference speed, meeting the requirements of lower VRAM GPUs while retaining minimal video
|
| 142 |
+
quality degradation, at the cost of significant speed reduction.
|
| 143 |
+
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
|
| 144 |
+
used to quantize the text encoder, Transformer, and VAE modules, reducing CogVideoX’s memory requirements, making it
|
| 145 |
+
feasible to run the model on smaller VRAM GPUs. TorchAO quantization is fully compatible with `torch.compile`,
|
| 146 |
+
significantly improving inference speed. `FP8` precision is required for NVIDIA H100 and above, which requires source
|
| 147 |
+
installation of `torch`, `torchao`, `diffusers`, and `accelerate`. Using `CUDA 12.4` is recommended.
|
| 148 |
+
+ Inference speed testing also used the above VRAM optimizations, and without optimizations, speed increases by about
|
| 149 |
+
10%. Only `diffusers` versions of models support quantization.
|
| 150 |
+
+ Models support English input only; other languages should be translated into English during prompt crafting with a
|
| 151 |
+
larger model.
|
| 152 |
+
|
| 153 |
+
**Note**
|
| 154 |
+
|
| 155 |
+
+ Use [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning SAT version models. Check our
|
| 156 |
+
GitHub for more details.
|
| 157 |
+
|
| 158 |
+
## Getting Started Quickly 🤗
|
| 159 |
+
|
| 160 |
+
This model supports deployment using the Hugging Face diffusers library. You can follow the steps below to get started.
|
| 161 |
+
|
| 162 |
+
**We recommend that you visit our [GitHub](https://github.com/THUDM/CogVideo) to check out prompt optimization and
|
| 163 |
+
conversion to get a better experience.**
|
| 164 |
+
|
| 165 |
+
1. Install the required dependencies
|
| 166 |
+
|
| 167 |
+
```shell
|
| 168 |
+
# diffusers (from source)
|
| 169 |
+
# transformers>=4.46.2
|
| 170 |
+
# accelerate>=1.1.1
|
| 171 |
+
# imageio-ffmpeg>=0.5.1
|
| 172 |
+
pip install git+https://github.com/huggingface/diffusers
|
| 173 |
+
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
2. Run the code
|
| 177 |
+
|
| 178 |
+
```python
|
| 179 |
+
import torch
|
| 180 |
+
from diffusers import CogVideoXPipeline
|
| 181 |
+
from diffusers.utils import export_to_video
|
| 182 |
+
|
| 183 |
+
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
|
| 184 |
+
|
| 185 |
+
pipe = CogVideoXPipeline.from_pretrained(
|
| 186 |
+
"THUDM/CogVideoX1.5-5B",
|
| 187 |
+
torch_dtype=torch.bfloat16
|
| 188 |
+
)
|
| 189 |
+
|
| 190 |
+
pipe.enable_sequential_cpu_offload()
|
| 191 |
+
pipe.vae.enable_tiling()
|
| 192 |
+
pipe.vae.enable_slicing()
|
| 193 |
+
|
| 194 |
+
video = pipe(
|
| 195 |
+
prompt=prompt,
|
| 196 |
+
num_videos_per_prompt=1,
|
| 197 |
+
num_inference_steps=50,
|
| 198 |
+
num_frames=81,
|
| 199 |
+
guidance_scale=6,
|
| 200 |
+
generator=torch.Generator(device="cuda").manual_seed(42),
|
| 201 |
+
).frames[0]
|
| 202 |
+
|
| 203 |
+
export_to_video(video, "output.mp4", fps=8)
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
## Quantized Inference
|
| 207 |
+
|
| 208 |
+
[PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
|
| 209 |
+
used to quantize the text encoder, transformer, and VAE modules to reduce CogVideoX's memory requirements. This allows
|
| 210 |
+
the model to run on free T4 Colab or GPUs with lower VRAM! Also, note that TorchAO quantization is fully compatible
|
| 211 |
+
with `torch.compile`, which can significantly accelerate inference.
|
| 212 |
+
|
| 213 |
+
```python
|
| 214 |
+
# To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
|
| 215 |
+
# Source and nightly installation is only required until the next release.
|
| 216 |
+
|
| 217 |
+
import torch
|
| 218 |
+
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXImageToVideoPipeline
|
| 219 |
+
from diffusers.utils import export_to_video
|
| 220 |
+
from transformers import T5EncoderModel
|
| 221 |
+
from torchao.quantization import quantize_, int8_weight_only
|
| 222 |
+
|
| 223 |
+
quantization = int8_weight_only
|
| 224 |
+
|
| 225 |
+
text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="text_encoder",
|
| 226 |
+
torch_dtype=torch.bfloat16)
|
| 227 |
+
quantize_(text_encoder, quantization())
|
| 228 |
+
|
| 229 |
+
transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="transformer",
|
| 230 |
+
torch_dtype=torch.bfloat16)
|
| 231 |
+
quantize_(transformer, quantization())
|
| 232 |
+
|
| 233 |
+
vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="vae", torch_dtype=torch.bfloat16)
|
| 234 |
+
quantize_(vae, quantization())
|
| 235 |
+
|
| 236 |
+
# Create pipeline and run inference
|
| 237 |
+
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
|
| 238 |
+
"THUDM/CogVideoX1.5-5B",
|
| 239 |
+
text_encoder=text_encoder,
|
| 240 |
+
transformer=transformer,
|
| 241 |
+
vae=vae,
|
| 242 |
+
torch_dtype=torch.bfloat16,
|
| 243 |
+
)
|
| 244 |
+
|
| 245 |
+
pipe.enable_model_cpu_offload()
|
| 246 |
+
pipe.vae.enable_tiling()
|
| 247 |
+
pipe.vae.enable_slicing()
|
| 248 |
+
|
| 249 |
+
prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
|
| 250 |
+
video = pipe(
|
| 251 |
+
prompt=prompt,
|
| 252 |
+
num_videos_per_prompt=1,
|
| 253 |
+
num_inference_steps=50,
|
| 254 |
+
num_frames=81,
|
| 255 |
+
guidance_scale=6,
|
| 256 |
+
generator=torch.Generator(device="cuda").manual_seed(42),
|
| 257 |
+
).frames[0]
|
| 258 |
+
|
| 259 |
+
export_to_video(video, "output.mp4", fps=8)
|
| 260 |
+
```
|
| 261 |
+
|
| 262 |
+
Additionally, these models can be serialized and stored using PytorchAO in quantized data types to save disk space. You
|
| 263 |
+
can find examples and benchmarks at the following links:
|
| 264 |
+
|
| 265 |
+
- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
|
| 266 |
+
- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
|
| 267 |
+
|
| 268 |
+
## Further Exploration
|
| 269 |
+
|
| 270 |
+
Feel free to enter our [GitHub](https://github.com/THUDM/CogVideo), where you'll find:
|
| 271 |
+
|
| 272 |
+
1. More detailed technical explanations and code.
|
| 273 |
+
2. Optimized prompt examples and conversions.
|
| 274 |
+
3. Detailed code for model inference and fine-tuning.
|
| 275 |
+
4. Project update logs and more interactive opportunities.
|
| 276 |
+
5. CogVideoX toolchain to help you better use the model.
|
| 277 |
+
6. INT8 model inference code.
|
| 278 |
+
|
| 279 |
+
## Model License
|
| 280 |
+
|
| 281 |
+
This model is released under the [CogVideoX LICENSE](LICENSE).
|
| 282 |
+
|
| 283 |
+
## Citation
|
| 284 |
+
|
| 285 |
+
```
|
| 286 |
+
@article{yang2024cogvideox,
|
| 287 |
+
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
|
| 288 |
+
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
|
| 289 |
+
journal={arXiv preprint arXiv:2408.06072},
|
| 290 |
+
year={2024}
|
| 291 |
+
}
|
| 292 |
+
```
|