MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Catnip AI Team

๐ŸŒ Project mainecoon.tech
๐Ÿ•น๏ธ Experience Try it live
๐Ÿ“„ Paper (arXiv) arXiv:2606.17800
๐Ÿ“ Blog Technical Blog
๐Ÿ’ป GitHub catnip-ai-tech/MaineCoon

Abstract

As video content is increasingly consumed on social platforms, video generation models built for social worlds are important but largely overlooked. We present MaineCoon, the first real-time audio-visual autoregressive model. With 22B parameters, it is capable of real-time streaming generation and sub-second interaction, achieving a record-breaking frame rate of up to 47.5 FPS on a single GPU.

MaineCoon is optimized for social-interactive applications using several novel techniques: self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also introduce an agentic streaming inference framework that supports thousand-second-scale generation while mitigating drift.

Highlights

  • โšก Real-time on a single GPU. Capable of streaming generation and sub-second interaction on a single H100.
  • ๐ŸŒ A new paradigm: social world models. Serves as the first generative core for social world models, a foundation for next-generation AI-native social platforms.
  • ๐ŸŽ“ Forcing-free streaming training. Multi-stage training enabling native, efficient streaming audio-visual training at 22B scale.
  • ๐Ÿง  Agentic streaming inference. Supports long-horizon generation through agentic cache management and prompt planning.
  • ๐Ÿ“Š SocialVideo-Bench. A new benchmark where MaineCoon outperforms 7 representative open audio-visual models while achieving the fastest generation speed.

Benchmark โ€” SocialVideo-Bench

Main quantitative results on SocialVideo-Bench. ๐Ÿฑ MaineCoon (Ours) achieves the best average score and wins most metrics, including Audio-Visual Harmony (AVH) and Joint Audio-Visual Integrated Score (JAVIS).

Type Model Visโ†‘ Motโ†‘ Audโ†‘ AV-Alโ†‘ AVHโ†‘ JAVISโ†‘ Averageโ†‘
Bidirectional T2AV LTX-2.3 4.10 0.99 4.06 0.334 0.287 0.247 0.848
Streaming TA2V SoulX-FlashTalk 4.65 1.99 4.07 0.279 0.283 0.238 0.895
Streaming T2AV ๐Ÿฑ MaineCoon (Ours) 4.71 1.62 4.35 0.334 0.308 0.272 0.934 ๐Ÿฅ‡

Latency and model size comparison. Sampling throughput (FPS) measured for 480P 20-second generation on a single H100 GPU.

Type Model Params FPSโ†‘
Bidirectional T2AV LTX-2.3-Distilled 22B 20.7
Streaming T2V Causal-Forcing 1.3B 19.1
Streaming T2AV ๐Ÿฑ MaineCoon (Ours) 22B 47.5 ๐Ÿฅ‡

Acknowledgements

MaineCoon stands on the shoulders of the open-source community. We are especially grateful to:

  • ๐ŸŽฌ LTX-2.3 & the LTX series โ€” Lightricks. MaineCoon's audio-visual backbone builds on the excellent open LTX-2.3 model.
  • โšก DMD series. Our reinforced online-policy distillation (ROPD) builds on the Distribution Matching Distillation (DMD / DMD2) line of work.

Citation

@article{catnip2026mainecoon,
  title        = {MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model},
  author       = {Lichen Bai and Tianhao Zhang and Shitong Shao and Dingwei Tan and Qiyu Zhong and Zhengpeng Xie and Haopeng Li and Qinghao Huang and Dandan Shen and Tengjiao Ji and Wei Wang and Peicheng Wu and Yuxuan Zhao and Xiangyu Zhu and Welly Luo and Shurui Yang and Zeke Xie},
  year         = {2026},
  journal      = {arXiv preprint arXiv:2606.17800},
  url          = {https://arxiv.org/abs/2606.17800}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for catnip-ai-tech/MaineCoon

Finetuned
(28)
this model

Paper for catnip-ai-tech/MaineCoon