Benchmark guides

Research-grade guides on the AI evaluations that move procurement decisions. History, methodology, contamination notes, common gaming patterns, and a live leaderboard hydrated from the Benchlist registry.

24deep guides26stubs in progress50articles total

code agent · deep

164-problem code-generation benchmark · pass@1 · MIT

code agent · deep

EvalPlus hardening of HumanEval · 80× more unit tests

code agent · deep

Mostly Basic Python Problems · 974 entry-level tasks

code agent · stub

EvalPlus hardening over the sanitized MBPP subset.

code agent · deep

1,140 hard programming problems · function calls into 139 libraries

code agent · deep

Continuously updated competitive programming · contamination-free by design

code agent · deep

SWE-bench Verified

500 hand-verified real-world GitHub issues · pip-installable patch

code agent · deep

300 lighter SWE-bench tasks · Princeton subset for fast evaluation

reasoning · deep

57 academic subjects · 14k multiple-choice questions

reasoning · deep

MMLU's harder, less-contaminated successor · 12k problems

reasoning · deep

Graduate-level Google-Proof Q&A · 198 hard PhD-level questions

reasoning · stub

23 challenging reasoning tasks pulled from BIG-Bench.

reasoning · deep

AI2 Reasoning Challenge · 7,787 grade-school science MC questions

reasoning · deep

Sentence completion stress test · 10k validation examples

reasoning · deep

44k commonsense pronoun-resolution problems

reasoning · stub

Physical commonsense reasoning, binary choice.

reasoning · stub

Multistep soft reasoning: murder mysteries, team allocation.

reasoning · stub

Human-exam questions: SAT, GRE, Chinese civil service.

reasoning · deep

8.5K grade-school math word problems · numeric exact match

reasoning · deep

12.5K competition math problems · the parent of MATH-500

reasoning · deep

American Invitational Math Exam · 30 hard problems per year

reasoning · deep

Multilingual Grade School Math · GSM8K translated to 11 languages

reasoning · stub

Theorem-grounded STEM problems requiring numeric/expression answers.

agent framework · deep

Tool-using agent evaluation · realistic multi-turn customer-service tasks

agent framework · stub

General AI assistants, 466 real-world tasks, exact-match scoring.

agent framework · stub

Realistic web navigation tasks in hosted shopping / CMS / GitLab environments.

code agent · stub

Monthly-refreshed SWE-bench variant, contamination-resistant.

817 questions designed to elicit common falsehoods

4k factual questions designed to surface hallucination

reasoning · deep

12k commonsense multiple-choice questions · 5 options each

NVIDIA's 13-task long-context suite at configurable window sizes.

NIAH · Needle in a Haystack

Retrieve a seeded fact from 4k → 128k token contexts. Deterministic generator.

Long-conversation memory Q&A. GPT-4o canonical judge.

benchmark · stub

AI evaluation benchmark

reasoning · deep

Massive Multi-discipline Multi-modal Understanding · 11.5k expert-level questions

reasoning · stub

Math reasoning with visual context (1k testmini).

code agent · stub

SWE-bench Multimodal

JavaScript issues with visual context. Docker harness required.

benchmark · stub

Creative Writing

AI evaluation benchmark

benchmark · stub

AI evaluation benchmark

benchmark · stub

AI evaluation benchmark

reasoning · stub

Monthly-refreshed benchmark across math, coding, reasoning, data analysis, language.

benchmark · stub

Scalable Agentic Bench

AI evaluation benchmark

benchmark · stub

AI evaluation benchmark

benchmark · stub

AI evaluation benchmark

reasoning · stub

USMLE-style medical licensing exam questions.

benchmark · stub

AI evaluation benchmark

benchmark · stub

AI evaluation benchmark

benchmark · deep

Hidden expert-curated math problems · designed to be unsolved

benchmark · stub

Humanity Last Exam

AI evaluation benchmark

reasoning · deep

Abstraction & Reasoning Corpus · the few-shot pattern test that won't die