SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis (MICCAI 2025 - ORAL)
Ssharvien Kumar Sivakumar, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay
π‘Key Features
- First diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control.
- Outperforms previous methods both qualitatively and quantitatively, it also enables precise synthesis, providing accurate control over tool and anatomyβs size and movement, entrance of new tools, as well as the overall scene layout.
- We qualitatively motivate how SG2VID can be used for generative augmentation and present an experiment demonstrating its ability to improve a downstream phase detection task.
- We showcase SG2VIDβs ability to retain human control, we interact with the Scene Graphs to generate new video samples depicting major yet rare intra-operative irregularities.
This framework provides training scripts for the video diffusion model, supporting both unconditional and conditional training using signals such as the initial frame, scene graph, and text. Feel free to use our work for comparisons and to cite it!
π Setup
git clone https://github.com/MECLabTUDA/SG2VID.git
cd SG2VID
conda env create -f environment.yaml
conda activate sg2vid
π Model Checkpoints and Dataset
Download the checkpoints of all the necessary models from the provided sources and place them in [checkpoints](./checkpoints). We also provide the processed CATARACTS, Cataract-1K, Cholec80 dataset, containing images, segmentation masks and their scene graphs. Update the paths of the dataset in [configs](./configs).
Checkpoints: VAEs, Graph Encoders, Video Diffusion ModelsProcessed Dataset: Frames, Segmentation Masks, Scene Graphs
π₯ Sampling Videos with SG2VID
Conditioned with initial frame and graph
python sample.py --inference_config ./configs/inference/inference_img_graph_<dataset_name>.yaml
Conditioned with only graph
python sample.py --inference_config ./configs/inference/inference_ximg_graph_<dataset_name>.yaml
β³ Training SG2VID
Step 1: Train Image VQGAN and Segmentation VQGAN (For Graph Encoders)
python sg2vid/taming/main.py --base configs/vae/config_image_autoencoder_vqgan_<dataset_name>.yaml -t --gpus 0, --logdir checkpoints/<dataset_name>
python sg2vid/taming/main.py --base configs/vae/config_segmentation_autoencoder_vqgan_<dataset_name>.yaml -t --gpus 0, --logdir checkpoints/<dataset_name>
Step 2: Train Another VAE (For Video Diffusion Model)
python sg2vid/ldm/main.py --base configs/vae/config_autoencoderkl_<dataset_name>.yaml -t --gpus 0, --logdir checkpoints/<dataset_name>
# Converting a CompVis VAE to Diffusers VAE Format
# IMPORTANT: First update Diffusers to version 0.31.0, then downgrade back to 0.21.2
python scripts/ae_compvis_to_diffuser.py \
--vae_pt_path /path/to/checkpoints/last.ckpt \
--dump_path /path/to/save/vae_vid_diffusion
Step 3: Train Both Graph Encoders
python train_graph.py --name masked --config configs/graph/graph_<dataset_name>.yaml
python train_graph.py --name segclip --config configs/graph/graph_<dataset_name>.yaml
Step 4: Train Video Diffusion Model
Single-GPU Setup
python train.py --config configs/training/training_<cond_type>_<dataset_name>.yaml -n sg2vid_training
Multi-GPU Setup (Single Node)
python -m torch.distributed.run \
--nproc_per_node=${GPU_PER_NODE} \
--master_addr=127.0.0.1 \
--master_port=29501 \
--nnodes=1 \
--node_rank=0 \
train.py \
--config configs/training/training_<cond_type>_<dataset_name>.yaml \
-n sg2vid_training
β³ Training Unconditional Video Diffusion Model
Single-GPU Setup
python train.py --config configs/training/training_unconditional_<dataset_name>.yaml -n sg2vid_training
π Citations
If you are using SG2VID for your paper, please cite the following paper:
@article{sivakumar2025sg2vid,
title={SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis},
author={Sivakumar, Ssharvien Kumar and Frisch, Yannik and Ghazaei, Ghazal and Mukhopadhyay, Anirban},
journal={arXiv preprint arXiv:2506.03082},
year={2025}
}
β Acknowledgement
Thanks for the following projects and theoretical works that we have either used or inspired from:
- Downloads last month
- -