Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis
β¨ For more results, visit our Project Page β¨
π Updates
- [2025.07.11] π₯ The PyTorch model is now available.
- [2025.07.07] π₯ Ditto is accepted by ACM MM 2025.
- [2025.01.21] π₯ We update the Colab demo, welcome to try it.
- [2025.01.10] π₯ We release our inference codes and models.
- [2024.11.29] π₯ Our paper is in public on arxiv.
π οΈ Installation
Tested Environment
- System: Centos 7.2
- GPU: A100
- Python: 3.10
- tensorRT: 8.6.1
Clone the codes from GitHub:
git clone https://github.com/antgroup/ditto-talkinghead
cd ditto-talkinghead
Conda
Create conda environment:
conda env create -f environment.yaml
conda activate ditto
Pip
If you have problems creating a conda environment, you can also refer to our Colab. 
After correctly installing pytorch, cuda and cudnn, you only need to install a few packages using pip:
pip install \
    tensorrt==8.6.1 \
    librosa \
    tqdm \
    filetype \
    imageio \
    opencv_python_headless \
    scikit-image \
    cython \
    cuda-python \
    imageio-ffmpeg \
    colored \
    polygraphy \
    numpy==2.0.1
If you don't use conda, you may also need to install ffmpeg according to the official website.
π₯ Download Checkpoints
Download checkpoints from HuggingFace and put them in checkpoints dir:
git lfs install
git clone https://huggingface.co/digital-avatar/ditto-talkinghead checkpoints
The checkpoints should be like:
./checkpoints/
βββ ditto_cfg
β   βββ v0.4_hubert_cfg_trt.pkl
β   βββ v0.4_hubert_cfg_trt_online.pkl
βββ ditto_onnx
β   βββ appearance_extractor.onnx
β   βββ blaze_face.onnx
β   βββ decoder.onnx
β   βββ face_mesh.onnx
β   βββ hubert.onnx
β   βββ insightface_det.onnx
β   βββ landmark106.onnx
β   βββ landmark203.onnx
β   βββ libgrid_sample_3d_plugin.so
β   βββ lmdm_v0.4_hubert.onnx
β   βββ motion_extractor.onnx
β   βββ stitch_network.onnx
β   βββ warp_network.onnx
βββ ditto_trt_Ampere_Plus
    βββ appearance_extractor_fp16.engine
    βββ blaze_face_fp16.engine
    βββ decoder_fp16.engine
    βββ face_mesh_fp16.engine
    βββ hubert_fp32.engine
    βββ insightface_det_fp16.engine
    βββ landmark106_fp16.engine
    βββ landmark203_fp16.engine
    βββ lmdm_v0.4_hubert_fp32.engine
    βββ motion_extractor_fp32.engine
    βββ stitch_network_fp16.engine
    βββ warp_network_fp16.engine
- The ditto_cfg/v0.4_hubert_cfg_trt_online.pklis online config
- The ditto_cfg/v0.4_hubert_cfg_trt.pklis offline config
π Inference
Run inference.py:
python inference.py \
    --data_root "<path-to-trt-model>" \
    --cfg_pkl "<path-to-cfg-pkl>" \
    --audio_path "<path-to-input-audio>" \
    --source_path "<path-to-input-image>" \
    --output_path "<path-to-output-mp4>" 
For example:
python inference.py \
    --data_root "./checkpoints/ditto_trt_Ampere_Plus" \
    --cfg_pkl "./checkpoints/ditto_cfg/v0.4_hubert_cfg_trt.pkl" \
    --audio_path "./example/audio.wav" \
    --source_path "./example/image.png" \
    --output_path "./tmp/result.mp4" 
βNote:
We have provided the tensorRT model with hardware-compatibility-level=Ampere_Plus (checkpoints/ditto_trt_Ampere_Plus/). If your GPU does not support it, please execute the cvt_onnx_to_trt.py script to convert from the general onnx model (checkpoints/ditto_onnx/) to the tensorRT model.
python scripts/cvt_onnx_to_trt.py --onnx_dir "./checkpoints/ditto_onnx" --trt_dir "./checkpoints/ditto_trt_custom"
Then run inference.py with --data_root=./checkpoints/ditto_trt_custom.
β‘ PyTorch Model
Based on community interest and to better support further development, we are now open-sourcing the PyTorch version of the model.
We have added the PyTorch model and corresponding configuration files to the HuggingFace. Please refer to Download Checkpoints to prepare the model files.
The checkpoints should be like:
./checkpoints/
βββ ditto_cfg
β   βββ ...
β   βββ v0.4_hubert_cfg_pytorch.pkl
βββ ...
βββ ditto_pytorch
    βββ aux_models
    β   βββ 2d106det.onnx
    β   βββ det_10g.onnx
    β   βββ face_landmarker.task
    β   βββ hubert_streaming_fix_kv.onnx
    β   βββ landmark203.onnx
    βββ models
        βββ appearance_extractor.pth
        βββ decoder.pth
        βββ lmdm_v0.4_hubert.pth
        βββ motion_extractor.pth
        βββ stitch_network.pth
        βββ warp_network.pth
To run inference, execute the following command:
python inference.py \
    --data_root "./checkpoints/ditto_pytorch" \
    --cfg_pkl "./checkpoints/ditto_cfg/v0.4_hubert_cfg_pytorch.pkl" \
    --audio_path "./example/audio.wav" \
    --source_path "./example/image.png" \
    --output_path "./tmp/result.mp4" 
π§ Acknowledgement
Our implementation is based on S2G-MDDiffusion and LivePortrait. Thanks for their remarkable contribution and released code! If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.
βοΈ License
This repository is released under the Apache-2.0 license as found in the LICENSE file.
π Citation
If you find this codebase useful for your research, please use the following entry.
@article{li2024ditto,
    title={Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis},
    author={Li, Tianqi and Zheng, Ruobing and Yang, Minghui and Chen, Jingdong and Yang, Ming},
    journal={arXiv preprint arXiv:2411.19509},
    year={2024}
}