Small Vision-Language Models are Smart Compressors for Long Video Understanding Paper • 2604.08120 • Published 19 days ago • 20
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web Paper • 2604.08516 • Published 19 days ago • 42
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models Paper • 2604.08546 • Published 19 days ago • 115
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver Paper • 2604.08377 • Published 19 days ago • 286
ClawBench: Can AI Agents Complete Everyday Online Tasks? Paper • 2604.08523 • Published 19 days ago • 260
ELT: Elastic Looped Transformers for Visual Generation Paper • 2604.09168 • Published 18 days ago • 20
WildDet3D: Scaling Promptable 3D Detection in the Wild Paper • 2604.08626 • Published 19 days ago • 242
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music Paper • 2604.10905 • Published 15 days ago • 28
Strips as Tokens: Artist Mesh Generation with Native UV Segmentation Paper • 2604.09132 • Published 18 days ago • 53
Running 3.82k The Ultra-Scale Playbook 🌌 3.82k The ultimate guide to training LLM on large GPU Clusters