- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
Collections
Discover the best community collections!
Collections including paper arxiv:2507.21045 
						
					
				- 
	
	
	
Describe Anything: Detailed Localized Image and Video Captioning
Paper • 2504.16072 • Published • 63 - 
	
	
	
EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment
Paper • 2410.09604 • Published - 
	
	
	
Geospatial Mechanistic Interpretability of Large Language Models
Paper • 2505.03368 • Published • 10 - 
	
	
	
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
Paper • 2505.02836 • Published • 8 
- 
	
	
	
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper • 2405.20340 • Published • 20 - 
	
	
	
Spectrally Pruned Gaussian Fields with Neural Compensation
Paper • 2405.00676 • Published • 10 - 
	
	
	
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Paper • 2404.18212 • Published • 29 - 
	
	
	
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper • 2405.00732 • Published • 121 
- 
	
	
	
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7 - 
	
	
	
CityGPT: Empowering Urban Spatial Cognition of Large Language Models
Paper • 2406.13948 • Published • 1 - 
	
	
	
CityBench: Evaluating the Capabilities of Large Language Model as World Model
Paper • 2406.13945 • Published • 1 - 
	
	
	
A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science
Paper • 2504.09848 • Published 
- 
	
	
	
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
Paper • 2503.10437 • Published • 32 - 
	
	
	
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Paper • 2503.09642 • Published • 19 - 
	
	
	
VGGT: Visual Geometry Grounded Transformer
Paper • 2503.11651 • Published • 32 - 
	
	
	
1000+ FPS 4D Gaussian Splatting for Dynamic Scene Rendering
Paper • 2503.16422 • Published • 14 
- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
- 
	
	
	
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7 - 
	
	
	
CityGPT: Empowering Urban Spatial Cognition of Large Language Models
Paper • 2406.13948 • Published • 1 - 
	
	
	
CityBench: Evaluating the Capabilities of Large Language Model as World Model
Paper • 2406.13945 • Published • 1 - 
	
	
	
A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science
Paper • 2504.09848 • Published 
- 
	
	
	
Describe Anything: Detailed Localized Image and Video Captioning
Paper • 2504.16072 • Published • 63 - 
	
	
	
EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment
Paper • 2410.09604 • Published - 
	
	
	
Geospatial Mechanistic Interpretability of Large Language Models
Paper • 2505.03368 • Published • 10 - 
	
	
	
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
Paper • 2505.02836 • Published • 8 
- 
	
	
	
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
Paper • 2503.10437 • Published • 32 - 
	
	
	
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Paper • 2503.09642 • Published • 19 - 
	
	
	
VGGT: Visual Geometry Grounded Transformer
Paper • 2503.11651 • Published • 32 - 
	
	
	
1000+ FPS 4D Gaussian Splatting for Dynamic Scene Rendering
Paper • 2503.16422 • Published • 14 
- 
	
	
	
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper • 2405.20340 • Published • 20 - 
	
	
	
Spectrally Pruned Gaussian Fields with Neural Compensation
Paper • 2405.00676 • Published • 10 - 
	
	
	
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Paper • 2404.18212 • Published • 29 - 
	
	
	
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper • 2405.00732 • Published • 121