Is AIME 2026 Benchmark evaluated with Tool Call enabled?

#14
by ghostplant - opened

Without tool calls, I cannot reproduce 99% for GLM-5.2, I get ~90% instead.

Z.ai org

Hi,the system prompt and sampling parameters should be applied as described in the footnote.

image

Sign up or log in to comment