Emu3.5
Collection
Native Multimodal Models are World Learners 🌍
•
3 items
•
Updated
•
66
Emu3.5 Team, BAAI
Project Page | 🤗HF Models | Paper | Code
| 🔹 | Core Concept | Description |
|---|---|---|
| 🧠 | Unified World Modeling | Predicts the next state jointly across vision and language, enabling coherent world modeling and generation. |
| 🧩 | End-to-End Pretraining | Trained with a unified next-token prediction objective over interleaved vision–language sequences. |
| 📚 | Over 10T+ Multimodal Tokens | Pre-trained on over 10 trillion interleaved tokens from video frames and transcripts, capturing spatiotemporal structure. |
| 🔄 | Native Multimodal I/O | Processes and generates interleaved visual–text sequences without modality adapters or task-specific heads. |
| 🎯 | RL Post-Training | Large-scale reinforcement learning enhances reasoning, compositionality, and generation quality. |
| ⚡ | Discrete Diffusion Adaptation (DiDA) | Converts sequential decoding → bidirectional parallel prediction, achieving ≈20× faster inference without performance loss. |
| 🖼️ | Versatile Generation | Excels in long-horizon vision–language generation, any-to-image (X2I) synthesis, and text-rich image creation. |
| 🌐 | Generalizable World Modeling | Enables spatiotemporally consistent world exploration, and open-world embodied manipulation across diverse scenarios. |
| 🏆 | Performance Benchmark | Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing, and outperforms on interleaved generation tasks. |
Emu3.5 handles general tasks(including interleaved generation and image generation/editing), while Emu3.5-Image focuses on high-quality image generation/editing.
git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
Edit configs/config.py to set:
model_path, vq_pathtask_type in {t2i, x2i, howto, story, explore, vla}use_image (True to provide reference images, controls <|IMAGE|> token); set reference_image in each prompt to specify the image path.sampling_params (classifier_free_guidance, temperature, top_k/top_p, etc.)python inference.py --cfg configs/config.py
Protobuf outputs are written to outputs/<exp_name>/proto/. For better throughput, we recommend ≥2 GPUs.
To visualize generated protobuf files:
python src/utils/vis_proto.py --input <input_proto_file> --output <output_dir>
@misc{cui2025emu35nativemultimodalmodels,
title={Emu3.5: Native Multimodal Models are World Learners},
author={Yufeng Cui and Honghao Chen and Haoge Deng and Xu Huang and Xinghang Li and Jirong Liu and Yang Liu and Zhuoyan Luo and Jinsheng Wang and Wenxuan Wang and Yueze Wang and Chengyuan Wang and Fan Zhang and Yingli Zhao and Ting Pan and Xianduo Li and Zecheng Hao and Wenxuan Ma and Zhuo Chen and Yulong Ao and Tiejun Huang and Zhongyuan Wang and Xinlong Wang},
year={2025},
eprint={2510.26583},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.26583},
}