Marketing number. Attested number. The gap.

For every tracked model × benchmark pair, we record (a) the number the publisher marketed, (b) the number that landed on-chain when we re-ran it under their own methodology, (c) the delta. Deltas over ±2 pp get a red pill. Methodology open-sourced. Publishers are encouraged to dispute.

First Contamination Index is being prepared.

We're compiling the initial sample of publisher-claimed numbers across SWE-Bench Verified, MBPP, GSM8K, and LongMemEval. The first report drops the week after our 10th signed SWE-Bench run lands. Sign up to get it in your inbox.

Subscribe to the Index

Four sources → one delta.

Step 1

Scrape marketed numbers

Model-card page, release post, product page, arxiv abstract. We record URL + date + number. Multiple sources per claim when available.

Step 2

Re-run with our signed adapter

Same dataset hash the publisher referenced. Publisher's recommended decoding config if published; otherwise temp=0, max_tokens default. Three runs, median reported.

Step 3

Sign, anchor, publish

Our signed attestation goes on-chain. The delta row links to the verify page. Publishers can re-run on their own infra and file a dispute if they think ours is wrong.

Step 4

Publish weekly

Dropped every Monday to the subscribe list + posted to /contamination. Historical snapshots kept, no silent edits.

To publishers

Not on the Index yet? You don't want to be. Sign your own score before we do, start here. Self-attested runs always trump our re-runs (because you know your stack better), and they keep the delta row at 0.