RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models Paper • 2310.00746 • Published Oct 1, 2023 • 1
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models Paper • 2410.11710 • Published Oct 15, 2024 • 20
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model Paper • 2410.13639 • Published Oct 17, 2024 • 19
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Paper • 2502.14739 • Published Feb 20 • 104
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? Paper • 2502.19361 • Published Feb 26 • 28
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models Paper • 2505.02735 • Published May 5 • 33
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization Paper • 2507.06181 • Published Jul 8 • 43
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs Paper • 2504.15415 • Published Apr 21 • 22
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models Paper • 2502.16614 • Published Feb 23 • 27