AIME 2024, methodology, history, and how to verify a published score

Q: What's the biggest pitfall when reporting AIME 2024?

Tiny denominator. 30 problems means a 1-problem swing is ~3.3pp. Confidence intervals on AIME scores are wide. Don't read 33% vs 37% as a real difference without n≥3 runs.

Q: How do I verify a published AIME 2024 score?

Use Benchlist. Run via benchlist run aime or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

AIME (American Invitational Mathematics Examination) is a 30-question, three-hour test taken by top-scoring AMC participants. Answers are integers 0–999.

AIME 2024 became a canonical frontier-reasoning evaluation after o1's release, solving even 10 of 30 problems was a non-trivial signal in 2024. By 2026, frontier models reach 50%+ on AIME 2024, but performance on the unleaked 2025 problems is much lower (a contamination signal in itself).

How AIME 2024 is graded

Single integer answer per problem in [0, 999]. Grading is exact match. Models typically solve via chain-of-thought followed by a final answer.

Always specify the year. AIME 2023 problems are in many training corpora; AIME 2025 generally is not. The same model can score 60% on 2023 and 25% on 2025, that gap is the contamination delta.

Common pitfalls when reporting AIME 2024

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Tiny denominator. 30 problems means a 1-problem swing is ~3.3pp. Confidence intervals on AIME scores are wide. Don't read 33% vs 37% as a real difference without n≥3 runs.
Year leakage. AIME problems older than the training cutoff are heavily memorised. Compare current-year performance, not aggregate.
Chain-of-thought is the model. AIME without scratchpad is a different benchmark, and one most models fail at.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · AIME 2024

Full leaderboard →

Loading…

How to ship a AIME 2024 score that nobody can challenge

Run AIME 2024 on Benchlist

Benchlist runs the canonical AIME 2024 sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "aime",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run aime --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is AIME 2024?

AIME (American Invitational Mathematics Examination) is a 30-question, three-hour test taken by top-scoring AMC participants. Answers are integers 0–999.

How is AIME 2024 scored?

Single integer answer per problem in [0, 999]. Grading is exact match. Models typically solve via chain-of-thought followed by a final answer.

What's the biggest pitfall when reporting AIME 2024?

Tiny denominator. 30 problems means a 1-problem swing is ~3.3pp. Confidence intervals on AIME scores are wide. Don't read 33% vs 37% as a real difference without n≥3 runs.

How do I verify a published AIME 2024 score?

Use Benchlist. Run via benchlist run aime or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for AIME 2024?

Per the catalog, AIME 2024 runs at temperature 0.0 with max_tokens 4096. Deviating without disclosure makes scores incomparable.