Establishing Best Practices for Rigorous Agentic Benchmarks

Yuxuan Zhu¹, Tengjun Jin¹, Yada Pruksachatkun, Andy Zhang², Shu Liu³, Sasha Cui⁴, Sayash Kapoor⁵, Shayne Longpre⁶, Kevin Meng⁷, Rebecca Weiss⁸, Fazl Barez⁸, Rahul Gupta⁹, Jwala Dhamala⁹, Jacob Merizian¹⁰, Mario Giulianelli¹⁰, Harry Coppock¹⁰, Cozmin Ududec¹⁰, Jasjeet Sekhon⁴, Jacob Steinhardt⁷, Sarah Schwettmann⁷, Matei Zaharia³, Ion Stoica³, Percy Liang², Daniel Kang¹

¹ ² ³ ⁴ ⁵ ⁶

⁷ ⁸ ⁹ ¹⁰

Paper Repository Checklist

Problem

As AI agents move from research demos to real-world assistants, the only way to know what they can (and cannot) do is to test them. Benchmarks have been developed as a way to benchmark the high-level capabilities and shortcomings of various agentic frameworks and base models, and are crucial in steering research, shaping product roadmaps, and helping customers pick the right model. However, these benchmarks often contain flaws that lead to major misrepresentation in performance of up to 40% on popular benchmarks such as SWE-bench-Verified and τ-bench.

Taxonomy

We identify two major challenges in creating rigorous agentic benchmarks:

Task Validity: a task should be solvable if and only if the agent possesses the target capability.
Outcome Validity: the evaluation method (e.g., tests or checks) should indicate correctly whether the task has been solved.

Checklist Assessment

We develop the Agentic Benchmark Checklist (ABC), consisting of concrete and actionable guidelines to ensure outcome and task validity. In cases where perfect guarantees of outcome and task validity are particular challenging or impossible, we also provide guidelines to ensure the quality and rigor of benchmark reporting.

We apply ABC on ten widely used agentic benchmarks:

Benchmark	Outcome Validity	Task Validity	Benchmark Reporting	Overall	Contributor
MLE-Bench	100	90	92.3	94.1	AgenticBenchmarkChecklist Team
CyBench	100	100	69.2	89.7	AgenticBenchmarkChecklist Team
GAIA	100	60	53.8	71.3	AgenticBenchmarkChecklist Team
\(\tau\)-Bench	50	100	46.2	65.4	AgenticBenchmarkChecklist Team
OSWorld	66.7	80	46.2	64.3	AgenticBenchmarkChecklist Team
SWE-Bench-Lancer	50	80	53.8	61.3	AgenticBenchmarkChecklist Team
SWE-Bench-Verified	50	100	30.8	60.3	AgenticBenchmarkChecklist Team
Bird-Bench	50	60	46.2	52.1	AgenticBenchmarkChecklist Team
WebArena	50	40	46.2	45.4	AgenticBenchmarkChecklist Team
KernelBench	0	80	53.8	44.6	AgenticBenchmarkChecklist Team

Based on our analysis, we suggest the following best practices for benchmark developers:

Use process-based evaluation metrics alongside outcome-based metrics.
Benchmark your LLM-as-a-judge in a reproducible manner. Tools such as AlignEval can help with evaluating your LLM evaluator.
If possible, use frozen websites for tasks that require navigation and website reading.

Contribute to the Agent Benchmark Checklist

Upholding the validity of agentic benchmarks requires effort from the broader scientific community. If you’re passionate about reliable evaluation in AI, we’d love your help.

Here’s some ways to get involved:

Apply the checklist to an existing benchmark - submit here.
Contribute proof-of-concept exploits and fixes for those exploits in our repo.
Give feedback on the checklist itself here.