MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model
Catnip AI Team
| ๐ Project | mainecoon.tech |
| ๐น๏ธ Experience | Try it live |
| ๐ Paper (arXiv) | arXiv:2606.17800 |
| ๐ Blog | Technical Blog |
| ๐ป GitHub | catnip-ai-tech/MaineCoon |
Abstract
As video content is increasingly consumed on social platforms, video generation models built for social worlds are important but largely overlooked. We present MaineCoon, the first real-time audio-visual autoregressive model. With 22B parameters, it is capable of real-time streaming generation and sub-second interaction, achieving a record-breaking frame rate of up to 47.5 FPS on a single GPU.
MaineCoon is optimized for social-interactive applications using several novel techniques: self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also introduce an agentic streaming inference framework that supports thousand-second-scale generation while mitigating drift.
Highlights
- โก Real-time on a single GPU. Capable of streaming generation and sub-second interaction on a single H100.
- ๐ A new paradigm: social world models. Serves as the first generative core for social world models, a foundation for next-generation AI-native social platforms.
- ๐ Forcing-free streaming training. Multi-stage training enabling native, efficient streaming audio-visual training at 22B scale.
- ๐ง Agentic streaming inference. Supports long-horizon generation through agentic cache management and prompt planning.
- ๐ SocialVideo-Bench. A new benchmark where MaineCoon outperforms 7 representative open audio-visual models while achieving the fastest generation speed.
Benchmark โ SocialVideo-Bench
Main quantitative results on SocialVideo-Bench. ๐ฑ MaineCoon (Ours) achieves the best average score and wins most metrics, including Audio-Visual Harmony (AVH) and Joint Audio-Visual Integrated Score (JAVIS).
| Type | Model | Visโ | Motโ | Audโ | AV-Alโ | AVHโ | JAVISโ | Averageโ |
|---|---|---|---|---|---|---|---|---|
| Bidirectional T2AV | LTX-2.3 | 4.10 | 0.99 | 4.06 | 0.334 | 0.287 | 0.247 | 0.848 |
| Streaming TA2V | SoulX-FlashTalk | 4.65 | 1.99 | 4.07 | 0.279 | 0.283 | 0.238 | 0.895 |
| Streaming T2AV | ๐ฑ MaineCoon (Ours) | 4.71 | 1.62 | 4.35 | 0.334 | 0.308 | 0.272 | 0.934 ๐ฅ |
Latency and model size comparison. Sampling throughput (FPS) measured for 480P 20-second generation on a single H100 GPU.
| Type | Model | Params | FPSโ |
|---|---|---|---|
| Bidirectional T2AV | LTX-2.3-Distilled | 22B | 20.7 |
| Streaming T2V | Causal-Forcing | 1.3B | 19.1 |
| Streaming T2AV | ๐ฑ MaineCoon (Ours) | 22B | 47.5 ๐ฅ |
Acknowledgements
MaineCoon stands on the shoulders of the open-source community. We are especially grateful to:
- ๐ฌ LTX-2.3 & the LTX series โ Lightricks. MaineCoon's audio-visual backbone builds on the excellent open LTX-2.3 model.
- โก DMD series. Our reinforced online-policy distillation (ROPD) builds on the Distribution Matching Distillation (DMD / DMD2) line of work.
Citation
@article{catnip2026mainecoon,
title = {MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model},
author = {Lichen Bai and Tianhao Zhang and Shitong Shao and Dingwei Tan and Qiyu Zhong and Zhengpeng Xie and Haopeng Li and Qinghao Huang and Dandan Shen and Tengjiao Ji and Wei Wang and Peicheng Wu and Yuxuan Zhao and Xiangyu Zhu and Welly Luo and Shurui Yang and Zeke Xie},
year = {2026},
journal = {arXiv preprint arXiv:2606.17800},
url = {https://arxiv.org/abs/2606.17800}
}
Model tree for catnip-ai-tech/MaineCoon
Base model
Lightricks/LTX-Video