Establishing Best Practices for Rigorous Agentic Benchmarks
Yuxuan Zhu1, Tengjun Jin1, Yada Pruksachatkun, Andy Zhang2, Shu Liu3, Sasha Cui4, Sayash Kapoor5, Shayne Longpre6, Kevin Meng7, Rebecca Weiss8, Fazl Barez8, Rahul Gupta9, Jwala Dhamala9, Jacob Merizian10, Mario Giulianelli10, Harry Coppock10, Cozmin Ududec10, Jasjeet Sekhon4, Jacob Steinhardt7, Sarah Schwettmann7, Matei Zaharia3, Ion Stoica3, Percy Liang2, Daniel Kang1
1
2
3
4
5
6
7
8
9
10
Problem
As AI agents move from research demos to real-world assistants, the only way to know what they can (and cannot) do is to test them. Benchmarks have been developed as a way to benchmark the high-level capabilities and shortcomings of various agentic frameworks and base models, and are crucial in steering research, shaping product roadmaps, and helping customers pick the right model. However, these benchmarks often contain flaws that lead to major misrepresentation in performance of up to 40% on popular benchmarks such as SWE-bench-Verified and τ-bench.
Taxonomy
We identify two major challenges in creating rigorous agentic benchmarks:
- Task Validity: a task should be solvable if and only if the agent possesses the target capability.
- Outcome Validity: the evaluation method (e.g., tests or checks) should indicate correctly whether the task has been solved.
Checklist Assessment
We develop the Agentic Benchmark Checklist (ABC), consisting of concrete and actionable guidelines to ensure outcome and task validity. In cases where perfect guarantees of outcome and task validity are particular challenging or impossible, we also provide guidelines to ensure the quality and rigor of benchmark reporting.
We apply ABC on ten widely used agentic benchmarks:
Benchmark | Outcome Validity | Task Validity | Benchmark Reporting | Overall | Contributor |
---|---|---|---|---|---|
MLE-Bench | 100 | 90 | 92.3 | 94.1 | AgenticBenchmarkChecklist Team |
CyBench | 100 | 100 | 69.2 | 89.7 | AgenticBenchmarkChecklist Team |
GAIA | 100 | 60 | 53.8 | 71.3 | AgenticBenchmarkChecklist Team |
\(\tau\)-Bench | 50 | 100 | 46.2 | 65.4 | AgenticBenchmarkChecklist Team |
OSWorld | 66.7 | 80 | 46.2 | 64.3 | AgenticBenchmarkChecklist Team |
SWE-Bench-Lancer | 50 | 80 | 53.8 | 61.3 | AgenticBenchmarkChecklist Team |
SWE-Bench-Verified | 50 | 100 | 30.8 | 60.3 | AgenticBenchmarkChecklist Team |
Bird-Bench | 50 | 60 | 46.2 | 52.1 | AgenticBenchmarkChecklist Team |
WebArena | 50 | 40 | 46.2 | 45.4 | AgenticBenchmarkChecklist Team |
KernelBench | 0 | 80 | 53.8 | 44.6 | AgenticBenchmarkChecklist Team |
Based on our analysis, we suggest the following best practices for benchmark developers:
- Use process-based evaluation metrics alongside outcome-based metrics.
- Benchmark your LLM-as-a-judge in a reproducible manner. Tools such as AlignEval can help with evaluating your LLM evaluator.
- If possible, use frozen websites for tasks that require navigation and website reading.
Contribute to the Agent Benchmark Checklist
Upholding the validity of agentic benchmarks requires effort from the broader scientific community. If you’re passionate about reliable evaluation in AI, we’d love your help.
Here’s some ways to get involved: