Contribute to the Agentic Benchmark Checklist

Upholding the validity of agentic benchmarks requires effort from the broader scientific community. If you’re passionate about reliable evaluation in AI, we’d love your help.

Here’s some ways to get involved:

Apply the checklist to an existing benchmark - submit here.
Contribute proof-of-concept exploits and fixes for those exploits in our repo.
Give feedback on the checklist itself here.