On Robustness and Reliability of Benchmark-Based Evaluation of LLMs Paper • 2509.04013 • Published Sep 4 • 4