- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
Collections
Discover the best community collections!
Collections including paper arxiv:2412.02611 
						
					
				- 
	
	
	
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
Paper • 2505.15957 • Published • 3 - 
	
	
	
Roadmap towards Superhuman Speech Understanding using Large Language Models
Paper • 2410.13268 • Published • 34 - 
	
	
	
StressTest: Can YOUR Speech LM Handle the Stress?
Paper • 2505.22765 • Published • 17 - 
	
	
	
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Paper • 2411.05361 • Published • 3 
- 
	
	
	
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 - 
	
	
	
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Paper • 2406.18521 • Published • 29 - 
	
	
	
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 36 - 
	
	
	
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 95 
- 
	
	
	
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper • 2407.07053 • Published • 47 - 
	
	
	
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper • 2407.12772 • Published • 35 - 
	
	
	
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper • 2407.11691 • Published • 15 - 
	
	
	
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper • 2408.02718 • Published • 62 
- 
	
	
	
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Paper • 2411.18499 • Published • 18 - 
	
	
	
VLSBench: Unveiling Visual Leakage in Multimodal Safety
Paper • 2411.19939 • Published • 10 - 
	
	
	
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Paper • 2412.02611 • Published • 24 - 
	
	
	
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
Paper • 2412.03205 • Published • 18 
- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
- 
	
	
	
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper • 2407.07053 • Published • 47 - 
	
	
	
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper • 2407.12772 • Published • 35 - 
	
	
	
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper • 2407.11691 • Published • 15 - 
	
	
	
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper • 2408.02718 • Published • 62 
- 
	
	
	
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
Paper • 2505.15957 • Published • 3 - 
	
	
	
Roadmap towards Superhuman Speech Understanding using Large Language Models
Paper • 2410.13268 • Published • 34 - 
	
	
	
StressTest: Can YOUR Speech LM Handle the Stress?
Paper • 2505.22765 • Published • 17 - 
	
	
	
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Paper • 2411.05361 • Published • 3 
- 
	
	
	
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Paper • 2411.18499 • Published • 18 - 
	
	
	
VLSBench: Unveiling Visual Leakage in Multimodal Safety
Paper • 2411.19939 • Published • 10 - 
	
	
	
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Paper • 2412.02611 • Published • 24 - 
	
	
	
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
Paper • 2412.03205 • Published • 18 
- 
	
	
	
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 - 
	
	
	
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Paper • 2406.18521 • Published • 29 - 
	
	
	
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 36 - 
	
	
	
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 95