Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation Paper • 2606.12594 • Published 11 days ago • 16
VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models Paper • 2603.06148 • Published Mar 6 • 2
Do Composed Image Retrieval Benchmarks Require Multimodal Composition? Paper • 2605.14787 • Published May 15
VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models Paper • 2603.06148 • Published Mar 6 • 2
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks Paper • 2605.31433 • Published 23 days ago • 28
Learning GUI Grounding with Spatial Reasoning from Visual Feedback Paper • 2509.21552 • Published Sep 25, 2025 • 11
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent Paper • 2508.06600 • Published Aug 8, 2025 • 42
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency Paper • 2504.18589 • Published Apr 24, 2025 • 13 • 3