Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models
Authors: Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen โ (โ : corresponding author)
Key Features
- Uni-directional Temporal Attention with Warmup Mechanism
- Multitimestep KV-Cache for Temporal Attention during Inference
- Depth Prior for Better Structure Consistency
- Compatible with DreamBooth and LoRA for Various Styles
- TensorRT Supported
The speed evaluation is conducted on Ubuntu 20.04.6 LTS and Pytorch 2.2.2 with RTX 4090 GPU and Intel(R) Xeon(R) Platinum 8352V CPU. Denoising steps are set as 2.
| Resolution | TensorRT | FPS |
|---|---|---|
| 512 x 512 | On | 16.43 |
| 512 x 512 | Off | 6.91 |
| 768 x 512 | On | 12.15 |
| 768 x 512 | Off | 6.29 |
Real-Time Video2Video Demo
|
Human Face (Web Camera Input) |
Anime Character (Screen Video Input) |
Acknowledgements
The video and image demos in this GitHub repository were generated using LCM-LoRA. Stream batch in StreamDiffusion is used for model acceleration. The design of Video Diffusion Model is adopted from AnimateDiff. We use a third-party implementation of MiDaS implementation which support onnx export. Our online demo is modified from Real-Time-Latent-Consistency-Model.
BibTex
If you find it helpful, please consider citing our work:
@article{xing2024live2diff,
title={Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models},
author={Zhening Xing and Gereon Fox and Yanhong Zeng and Xingang Pan and Mohamed Elgharib and Christian Theobalt and Kai Chen},
booktitle={arXiv preprint arxiv:2407.08701},
year={2024}
}