- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
Collections
Discover the best community collections!
Collections including paper arxiv:2509.16197 
						
					
				- 
	
	
	
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 40 - 
	
	
	
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 31 - 
	
	
	
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Paper • 2509.06818 • Published • 29 - 
	
	
	
Interleaving Reasoning for Better Text-to-Image Generation
Paper • 2509.06945 • Published • 14 
- 
	
	
	
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 - 
	
	
	
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 18 - 
	
	
	
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 - 
	
	
	
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48 
- 
	
	
	
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 54 - 
	
	
	
				Qwen/Qwen3-Omni-30B-A3B-Instruct
Any-to-Any • 35B • Updated • 307k • 696 - 
	
	
	
				facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 553k • 68 - 
	
	
	
				nvidia/NV-Embed-v2
Feature Extraction • 8B • Updated • 151k • 477 
- 
	
	
	
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 54 - 
	
	
	
				InternRobotics/VLAC
Robotics • 2B • Updated • 35 • 37 - 
	
	
	
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
Paper • 2509.12203 • Published • 19 - 
	
	
	
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
Paper • 2509.15937 • Published • 20 
- 
	
	
	
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Paper • 2508.21113 • Published • 109 - 
	
	
	
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Paper • 2508.16949 • Published • 22 - 
	
	
	
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
Paper • 2508.21112 • Published • 75 - 
	
	
	
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper • 2508.21767 • Published • 12 
- 
	
	
	
				lusxvr/nanoVLM-222M
Image-Text-to-Text • 0.2B • Updated • 260 • 97 - 
	
	
	
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 36 - 
	
	
	
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper • 2505.24863 • Published • 97 - 
	
	
	
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Paper • 2505.17667 • Published • 88 
- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
- 
	
	
	
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 54 - 
	
	
	
				Qwen/Qwen3-Omni-30B-A3B-Instruct
Any-to-Any • 35B • Updated • 307k • 696 - 
	
	
	
				facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 553k • 68 - 
	
	
	
				nvidia/NV-Embed-v2
Feature Extraction • 8B • Updated • 151k • 477 
- 
	
	
	
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 54 - 
	
	
	
				InternRobotics/VLAC
Robotics • 2B • Updated • 35 • 37 - 
	
	
	
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
Paper • 2509.12203 • Published • 19 - 
	
	
	
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
Paper • 2509.15937 • Published • 20 
- 
	
	
	
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 40 - 
	
	
	
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 31 - 
	
	
	
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Paper • 2509.06818 • Published • 29 - 
	
	
	
Interleaving Reasoning for Better Text-to-Image Generation
Paper • 2509.06945 • Published • 14 
- 
	
	
	
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Paper • 2508.21113 • Published • 109 - 
	
	
	
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Paper • 2508.16949 • Published • 22 - 
	
	
	
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
Paper • 2508.21112 • Published • 75 - 
	
	
	
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper • 2508.21767 • Published • 12 
- 
	
	
	
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 - 
	
	
	
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 18 - 
	
	
	
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 - 
	
	
	
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48 
- 
	
	
	
				lusxvr/nanoVLM-222M
Image-Text-to-Text • 0.2B • Updated • 260 • 97 - 
	
	
	
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 36 - 
	
	
	
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper • 2505.24863 • Published • 97 - 
	
	
	
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Paper • 2505.17667 • Published • 88