GPQA Diamond, methodology, history, and how to verify a published score

Q: What is GPQA Diamond?

Introduced by Rein et al. in GPQA: A Graduate-Level Google-Proof Q&A Benchmark (2023). 448 PhD-level multiple-choice questions in biology, physics, and chemistry written by domain experts and validated against multi-hour expert efforts.

Q: What's the biggest pitfall when reporting GPQA Diamond?

Tiny denominator. 198 problems means individual questions are ~0.5pp each. Score gaps under 2pp are within sampling noise.

Q: How do I verify a published GPQA Diamond score?

Use Benchlist. Run via benchlist run gpqa or POST /v1/run — the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

History

Introduced by Rein et al. in GPQA: A Graduate-Level Google-Proof Q&A Benchmark (2023). 448 PhD-level multiple-choice questions in biology, physics, and chemistry written by domain experts and validated against multi-hour expert efforts.

GPQA Diamond is the 198-question high-quality subset. The 'Google-proof' framing means the questions are designed so that even with internet access, a non-expert cannot easily find the answer. By 2026, frontier models score 75–82% on Diamond.

How GPQA Diamond is graded

Four-option multiple-choice. Letter-match grading. Most papers use 5-shot prompting; some use zero-shot to avoid in-context contamination.

Validation set (Diamond): expert PhDs in the relevant domain achieved 65% on average; non-experts (with internet) achieved 34%. So a model scoring 70% is genuinely beyond the domain-non-expert ceiling.

Common pitfalls when reporting GPQA Diamond

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

Tiny denominator. 198 problems means individual questions are ~0.5pp each. Score gaps under 2pp are within sampling noise.
Subject distribution. Diamond is 1/3 biology, 1/3 physics, 1/3 chemistry. Models often have domain skews, a 5pp gap can come entirely from one subject.
Reasoning-mode dependency. GPQA is the canonical chain-of-thought benchmark. Without scratchpad, top models drop 20+ pp.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · GPQA Diamond

Full leaderboard →

Loading…

How to ship a GPQA Diamond score that nobody can challenge

Run GPQA Diamond on Benchlist

Benchlist runs the canonical GPQA Diamond sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "gpqa",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run gpqa --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is GPQA Diamond?

How is GPQA Diamond scored?

Four-option multiple-choice. Letter-match grading. Most papers use 5-shot prompting; some use zero-shot to avoid in-context contamination.

What's the biggest pitfall when reporting GPQA Diamond?

Tiny denominator. 198 problems means individual questions are ~0.5pp each. Score gaps under 2pp are within sampling noise.

How do I verify a published GPQA Diamond score?

Use Benchlist. Run via benchlist run gpqa or POST /v1/run, the result includes a Merkle commitment over every transcript, an Ed25519 signature, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in their browser.

What are the canonical decoding parameters for GPQA Diamond?

Per the catalog, GPQA Diamond runs at temperature 0.0 with max_tokens 1024. Deviating without disclosure makes scores incomparable.