- 
	
	
	
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 74 - 
	
	
	
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200 - 
	
	
	
YourBench: Easy Custom Evaluation Sets for Everyone
Paper • 2504.01833 • Published • 22 - 
	
	
	
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 246 
Collections
Discover the best community collections!
Collections including paper arxiv:2504.01833 
						
					
				- 
	
	
	
Training Software Engineering Agents and Verifiers with SWE-Gym
Paper • 2412.21139 • Published • 24 - 
	
	
	
Evaluating Language Models as Synthetic Data Generators
Paper • 2412.03679 • Published • 48 - 
	
	
	
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 - 
	
	
	
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Paper • 2402.03620 • Published • 117 
- 
	
	
	
LoRA+: Efficient Low Rank Adaptation of Large Models
Paper • 2402.12354 • Published • 6 - 
	
	
	
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 23 - 
	
	
	
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 13 - 
	
	
	
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69 
- 
	
	
	
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
Paper • 2406.08587 • Published • 16 - 
	
	
	
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper • 2406.09170 • Published • 27 - 
	
	
	
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Paper • 2407.18901 • Published • 35 - 
	
	
	
Benchmarking Agentic Workflow Generation
Paper • 2410.07869 • Published • 28 
- 
	
	
	
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 238 - 
	
	
	
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 37 - 
	
	
	
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 - 
	
	
	
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 39 
- 
	
	
	
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 74 - 
	
	
	
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200 - 
	
	
	
YourBench: Easy Custom Evaluation Sets for Everyone
Paper • 2504.01833 • Published • 22 - 
	
	
	
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 246 
- 
	
	
	
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
Paper • 2406.08587 • Published • 16 - 
	
	
	
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper • 2406.09170 • Published • 27 - 
	
	
	
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Paper • 2407.18901 • Published • 35 - 
	
	
	
Benchmarking Agentic Workflow Generation
Paper • 2410.07869 • Published • 28 
- 
	
	
	
Training Software Engineering Agents and Verifiers with SWE-Gym
Paper • 2412.21139 • Published • 24 - 
	
	
	
Evaluating Language Models as Synthetic Data Generators
Paper • 2412.03679 • Published • 48 - 
	
	
	
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 - 
	
	
	
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Paper • 2402.03620 • Published • 117 
- 
	
	
	
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 238 - 
	
	
	
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 37 - 
	
	
	
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 - 
	
	
	
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 39 
- 
	
	
	
LoRA+: Efficient Low Rank Adaptation of Large Models
Paper • 2402.12354 • Published • 6 - 
	
	
	
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 23 - 
	
	
	
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 13 - 
	
	
	
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69