For every tracked model × benchmark pair, we record (a) the number the publisher marketed, (b) the number that landed on-chain when we re-ran it under their own methodology, (c) the delta. Deltas over ±2 pp get a red pill. Methodology open-sourced. Publishers are encouraged to dispute.
We're compiling the initial sample of publisher-claimed numbers across SWE-Bench Verified, MBPP, GSM8K, and LongMemEval. The first report drops the week after our 10th signed SWE-Bench run lands. Sign up to get it in your inbox.
Subscribe to the IndexModel-card page, release post, product page, arxiv abstract. We record URL + date + number. Multiple sources per claim when available.
Same dataset hash the publisher referenced. Publisher's recommended decoding config if published; otherwise temp=0, max_tokens default. Three runs, median reported.
Our signed attestation goes on-chain. The delta row links to the verify page. Publishers can re-run on their own infra and file a dispute if they think ours is wrong.
Dropped every Monday to the subscribe list + posted to /contamination. Historical snapshots kept, no silent edits.
Not on the Index yet? You don't want to be. Sign your own score before we do, start here. Self-attested runs always trump our re-runs (because you know your stack better), and they keep the delta row at 0.