MATH, methodology, history, and how to verify a published score

Q: What is MATH?

Released by Hendrycks et al. in 2021 as Measuring Mathematical Problem Solving With the MATH Dataset . 12,500 problems from high-school competitions (AMC, AIME, HMMT, etc.) with full LaTeX solutions and final boxed answers.

Q: What's the biggest pitfall when reporting MATH?

Tool use is everything. MATH with a Python calculator is a different benchmark from MATH with a chain-of-thought scratchpad. Disclose.

Q: How do I verify a published MATH score?

Use Benchlist. Run via benchlist run math or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Released by Hendrycks et al. in 2021 as Measuring Mathematical Problem Solving With the MATH Dataset. 12,500 problems from high-school competitions (AMC, AIME, HMMT, etc.) with full LaTeX solutions and final boxed answers.

MATH was the canonical hard-math benchmark from 2021–2024. Frontier reasoning models now score 80%+; pre-o1 frontier (GPT-4) was ~50%. The 500-problem subset MATH-500 is the version OpenAI uses for o1-class evaluations.

How MATH is graded

Final boxed answer extracted via regex. Answer normalisation handles equivalent forms (1/2 = 0.5 = \frac{1}{2}). Tolerant text-match grading is the field standard.

Five difficulty levels (1–5) and seven subjects (algebra, geometry, etc.), score breakdowns by subject and level reveal more than aggregate scores.

Common pitfalls when reporting MATH

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Tool use is everything. MATH with a Python calculator is a different benchmark from MATH with a chain-of-thought scratchpad. Disclose.
Grading-script lottery. Different equivalence checkers produce 2–4pp drift. Same numbers, different graders, different scores.
Cross-subset comparisons. Full MATH ≠ MATH-500 ≠ MATH-Hard. Always specify the subset.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · MATH

Full leaderboard →

Loading…

How to ship a MATH score that nobody can challenge

Run MATH on Benchlist

Benchlist runs the canonical MATH sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "math",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run math --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is MATH?

How is MATH scored?

Final boxed answer extracted via regex. Answer normalisation handles equivalent forms (1/2 = 0.5 = \frac{1}{2}). Tolerant text-match grading is the field standard.

What's the biggest pitfall when reporting MATH?

Tool use is everything. MATH with a Python calculator is a different benchmark from MATH with a chain-of-thought scratchpad. Disclose.

How do I verify a published MATH score?

Use Benchlist. Run via benchlist run math or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for MATH?

Per the catalog, MATH runs at temperature 0.0 with max_tokens 1024. Deviating without disclosure makes scores incomparable.