-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2506.22434
-
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 -
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 12 -
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 -
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12
-
Scalable Pre-training of Large Autoregressive Image Models
Paper • 2401.08541 • Published • 38 -
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
Paper • 2403.03194 • Published • 15 -
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper • 2506.22434 • Published • 10
-
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper • 2506.22434 • Published • 10 -
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Paper • 2507.13348 • Published • 75 -
RewardDance: Reward Scaling in Visual Generation
Paper • 2509.08826 • Published • 72 -
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Paper • 2510.18876 • Published • 35
-
RL + Transformer = A General-Purpose Problem Solver
Paper • 2501.14176 • Published • 28 -
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 -
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Paper • 2412.12098 • Published • 4
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper • 2506.22434 • Published • 10 -
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Paper • 2507.13348 • Published • 75 -
RewardDance: Reward Scaling in Visual Generation
Paper • 2509.08826 • Published • 72 -
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Paper • 2510.18876 • Published • 35
-
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 -
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 12 -
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 -
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12
-
RL + Transformer = A General-Purpose Problem Solver
Paper • 2501.14176 • Published • 28 -
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 -
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Paper • 2412.12098 • Published • 4
-
Scalable Pre-training of Large Autoregressive Image Models
Paper • 2401.08541 • Published • 38 -
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
Paper • 2403.03194 • Published • 15 -
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper • 2506.22434 • Published • 10