MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Catnip AI Team


🌐 Project	mainecoon.tech
🕹️ Experience	Try it live
📄 Paper (arXiv)	arXiv:2606.17800
📝 Blog	Technical Blog
💻 GitHub	catnip-ai-tech/MaineCoon

Abstract

As video content is increasingly consumed on social platforms, video generation models built for social worlds are important but largely overlooked. We present MaineCoon, the first real-time audio-visual autoregressive model. With 22B parameters, it is capable of real-time streaming generation and sub-second interaction, achieving a record-breaking frame rate of up to 47.5 FPS on a single GPU.

MaineCoon is optimized for social-interactive applications using several novel techniques: self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also introduce an agentic streaming inference framework that supports thousand-second-scale generation while mitigating drift.

Highlights

⚡ Real-time on a single GPU. Capable of streaming generation and sub-second interaction on a single H100.
🌍 A new paradigm: social world models. Serves as the first generative core for social world models, a foundation for next-generation AI-native social platforms.
🎓 Forcing-free streaming training. Multi-stage training enabling native, efficient streaming audio-visual training at 22B scale.
🧠 Agentic streaming inference. Supports long-horizon generation through agentic cache management and prompt planning.
📊 SocialVideo-Bench. A new benchmark where MaineCoon outperforms 7 representative open audio-visual models while achieving the fastest generation speed.

Benchmark — SocialVideo-Bench

Main quantitative results on SocialVideo-Bench. 🐱 MaineCoon (Ours) achieves the best average score and wins most metrics, including Audio-Visual Harmony (AVH) and Joint Audio-Visual Integrated Score (JAVIS).

Type	Model	Vis↑	Mot↑	Aud↑	AV-Al↑	AVH↑	JAVIS↑	Average↑
Bidirectional T2AV	LTX-2.3	4.10	0.99	4.06	0.334	0.287	0.247	0.848
Streaming TA2V	SoulX-FlashTalk	4.65	1.99	4.07	0.279	0.283	0.238	0.895
Streaming T2AV	🐱 MaineCoon (Ours)	4.71	1.62	4.35	0.334	0.308	0.272	0.934 🥇

Latency and model size comparison. Sampling throughput (FPS) measured for 480P 20-second generation on a single H100 GPU.

Type	Model	Params	FPS↑
Bidirectional T2AV	LTX-2.3-Distilled	22B	20.7
Streaming T2V	Causal-Forcing	1.3B	19.1
Streaming T2AV	🐱 MaineCoon (Ours)	22B	47.5 🥇

Acknowledgements

MaineCoon stands on the shoulders of the open-source community. We are especially grateful to:

🎬 LTX-2.3 & the LTX series — Lightricks. MaineCoon's audio-visual backbone builds on the excellent open LTX-2.3 model.
⚡ DMD series. Our reinforced online-policy distillation (ROPD) builds on the Distribution Matching Distillation (DMD / DMD2) line of work.

Citation

@article{catnip2026mainecoon,
  title        = {MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model},
  author       = {Lichen Bai and Tianhao Zhang and Shitong Shao and Dingwei Tan and Qiyu Zhong and Zhengpeng Xie and Haopeng Li and Qinghao Huang and Dandan Shen and Tengjiao Ji and Wei Wang and Peicheng Wu and Yuxuan Zhao and Xiangyu Zhu and Welly Luo and Shurui Yang and Zeke Xie},
  year         = {2026},
  journal      = {arXiv preprint arXiv:2606.17800},
  url          = {https://arxiv.org/abs/2606.17800}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for catnip-ai-tech/MaineCoon

Base model

Lightricks/LTX-Video

Finetuned

(28)

this model

Paper for catnip-ai-tech/MaineCoon

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Paper • 2606.17800 • Published 14 days ago • 13