Guides · 50 benchmarks

Benchmark guides

Deep dives on the AI evaluations that actually move procurement decisions. Each guide covers methodology, scoring, common gaming patterns, and how to ship a 50-page-attested score nobody can challenge.

code agent
HumanEval
164 Python programming problems, pass@1 via deterministic test execution.
code agent
HumanEval+
EvalPlus hardening — ~80× more test cases than HumanEval.
code agent
MBPP
974 entry-level Python problems with auto-graded tests.
code agent
MBPP+
EvalPlus hardening over the sanitized MBPP subset.
code agent
BigCodeBench
1140 realistic Python tasks invoking complex libraries (pandas, matplotlib, sklearn).
code agent
LiveCodeBench
Contest-style coding problems with monthly time windows — contamination-resistant.
code agent
SWE-bench Verified
500 real GitHub issues, human-vetted, Docker test harness.
code agent
SWE-bench Lite
300-issue lightweight subset of SWE-bench.
reasoning
MMLU
57 subjects, 4-choice MCQ. The classic knowledge benchmark.
reasoning
MMLU-Pro
MMLU's hardened successor — 10 options, harder reasoning.
reasoning
GPQA Diamond
PhD-level physics, chemistry, biology. Google-proof.
reasoning
BIG-Bench Hard
23 challenging reasoning tasks pulled from BIG-Bench.
reasoning
ARC-Challenge
Grade-school science MCQ, harder subset.
reasoning
HellaSwag
Commonsense sentence-completion MCQ.
reasoning
WinoGrande
Commonsense coreference, binary choice (Winograd-style).
reasoning
PIQA
Physical commonsense reasoning, binary choice.
reasoning
MuSR
Multistep soft reasoning: murder mysteries, team allocation.
reasoning
AGIEval
Human-exam questions: SAT, GRE, Chinese civil service.
reasoning
GSM8K
8.5k grade-school arithmetic word problems.
reasoning
MATH
Competition math problems, LaTeX boxed answers.
reasoning
AIME 2024
American Invitational Math Exam. Integer answers 0-999.
reasoning
MGSM
Multilingual GSM8K across 11 languages.
reasoning
TheoremQA
Theorem-grounded STEM problems requiring numeric/expression answers.
agent framework
τ-Bench
Multi-turn tool-use trajectories in retail and airline environments.
agent framework
GAIA
General AI assistants — 466 real-world tasks, exact-match scoring.
agent framework
WebArena
Realistic web navigation tasks in hosted shopping / CMS / GitLab environments.
code agent
SWE-rebench
Monthly-refreshed SWE-bench variant, contamination-resistant.
safety
TruthfulQA MC1
817 tricky questions, single-correct multiple choice.
safety
SimpleQA
OpenAI's factuality benchmark. 4,326 short-answer questions, GPT-4o judged.
reasoning
CommonsenseQA
5-way MCQ over ConceptNet-derived commonsense.
memory
RULER
NVIDIA's 13-task long-context suite at configurable window sizes.
memory
NIAH · Needle in a Haystack
Retrieve a seeded fact from 4k → 128k token contexts. Deterministic generator.
memory
LongMemEval
Long-conversation memory Q&A. GPT-4o canonical judge.
benchmark
Infinite Bench
reasoning
MMMU
11.5k multimodal questions across 30 disciplines.
reasoning
MathVista
Math reasoning with visual context (1k testmini).
code agent
SWE-bench Multimodal
JavaScript issues with visual context. Docker harness required.
benchmark
Creative Writing
benchmark
Writingbench
benchmark
Alpacaeval
reasoning
LiveBench
Monthly-refreshed benchmark across math, coding, reasoning, data analysis, language.
benchmark
Scalable Agentic Bench
benchmark
Mle Bench
benchmark
Finbench
reasoning
MedQA
USMLE-style medical licensing exam questions.
benchmark
Lawbench
benchmark
Chembench
benchmark
Frontier Math
benchmark
Humanity Last Exam
reasoning
ARC-AGI
Chollet's abstraction and reasoning corpus. Grid-puzzle transformations.