The honestAI benchmark leaderboard

Don't trust us. Re-run it for $0.50.

Every score on Benchlist is a fresh re-run of a public benchmark, signed, with confidence intervals shown inline and contamination flagged honestly. Anyone pays fifty cents to challenge any number, the result lands as an independent receipt. Vendor blog posts shouldn't be the source of truth for which model to ship.

signed re-runs across 11 benchmarks · every one replayable for $0.50.
No card · one-click login from inbox · single test from $5 · try free, no signup → · browse leaderboard · live attested run
Wilson 95% CIs shown Real HuggingFace samples Contamination flagged
Services
Benchmark suites
Attested runs
Publishers
The replay challenge
Pay $0.50. Run it on a different attestor. Compare receipts.

No leaderboard service lets you do this. Pick any signed run, queue an independent re-run on a fresh canonical sample, get a second Ed25519 receipt from a different attestor in under five minutes. Disagreements are public. The protocol does the trust work, we don't ask for it.

Try a replay →
Live attestation · loading…
Pulling from /api/runs.json
We flag contaminated benchmarks. Free.

GSM8K is saturated, MMLU is everywhere in training corpora, HumanEval predates most modern training cutoffs. Every leaderboard row carries a contamination tier. Read why GSM8K headlines are noise →

For buyers
Find a model.

Top attested score per benchmark, sorted by trust tier first. Self-reported numbers ranked below cryptographically-signed ones.

Browse /best →
For vendors
Get verified.

Email gets you an API key. POST /v1/run + your benchmark + model. We sign, store, email back the verify URL.

Get an API key →
For researchers
Audit a score.

Paste any run ID. The Ed25519 signature replays in your browser. No server round-trip, no trust required of us. Real cryptography.

Open the proof viewer →
Settled on
Ethereum mainnet Aligned Layer
Last verified
All →
Verified⛓ZK proof on Ethereum L1 via Aligned Layer
AttestedEd25519 signed Merkle commitment, browser-replayable
LocalRun on the publisher's hardware, signed locally · browse →
Self-reportedVendor disclosure, not verified by Benchlist
Top scores across cloud + local on real Hugging Face datasets
Full leaderboard →
Loading attestations…
A certificate of attestation sealed with an emerald wax-seal, threaded to smaller proof cards, Benchlist's commitment chain visualized
Fig. 1, Every score, sealed.
The thesis, briefly
Self-reported numbers are a race to the bottom. Pick a favorable subset, tune to the eval, publish a blog post. Benchlist puts every score behind a cryptographic proof anyone can re-check, on Ethereum, forever.
From the about page

One request. End-to-end.

Watch the complete lifecycle, queue, run, commit, prove, batch, settle on mainnet, in under five seconds. Real SHA-256 commitment computed in your browser.


      
Pipeline
Real API Post your own

This week on Benchlist.

A rolling seven-day digest of every attestation that landed on-chain. Unedited, unspun, computed live from the same JSON the registry serves.

Full leaderboard →
Attested
runs, 7 days
Gas burned
USD, Ethereum L1
Publishers
unique, this week
Median proof
minutes, commit→chain
Biggest scores, last seven days
top 5
Leader per benchmark
live

Vendor announcements are a starting point. Replayable signed runs are the proof.

Four reasons every benchmark claim needs one.

Contamination
Training data leaks into the test set.
A signed receipt binds a score to a dataset hash. Swapping the set later is impossible.
Gaming
Leaderboard votes get manipulated.
Cryptographically signed runs cost $5 to fake and anyone can challenge them on-chain.
Audit
Procurement asks how you measured.
Signed dataset hash + methodology hash + Merkle root. You hand over a URL, not a PDF.
Reproducibility
Re-run the claim, bit-for-bit.
Every receipt ships with its replay command. Docker image pinned, seeds pinned, adapter pinned.

We don't replace leaderboards. We sign them.

Every other board runs on trust-me. Benchlist is the cryptographic signature on top.

Board
What they do well
Where they break
HuggingFace Open LLM
Discovery, 85k models indexed
No proof a score is honest
LMSys Chatbot Arena
Vibes, 6M anonymous pairwise votes
Votes are gameable, identity unverified
Artificial Analysis
Specs, 328 models, standard hardware
They run it, you trust them
Benchlist
Every score Ed25519 signed · replayable for $0.50 · optional ZK anchor
Nothing, if you pay the $5
Full comparison →

Two ways to use Benchlist.

Shopping for AI? Get signed quotes. Selling AI? Wear the Certified seal.

For buyers · free
Get signed quotes from 3-5 vendors.

Describe your use case. We match you to vendors whose signed scores on your must-have benchmarks are freshest. Free forever for buyers; vendors pay us per qualified intro.

Request a quote →
For vendors · $499/year
Benchlist Certified.

Quarterly re-attestation on a canonical suite. Seal + embed badge. Priority matching on /quotes. Free dispute coverage. Buyers look for the seal.

Get certified →

Sixteen categories, one standard.

From frontier LLMs to vector search, every listing comes with attested benchmark results.

Recently verified.

View all

Benchmark → attest → publish.

The whole chain is open. You can replay any run bit-for-bit on your own hardware.

Step 1
Run
A trusted attestor runs the benchmark against the service. Full transcripts are stored.
Step 2
Commit
The runner computes a Merkle root over every (prompt, response, judge) tuple plus dataset and methodology hashes.
Step 3
Prove
A ZK proof of the scoring function over the commitment is submitted to Aligned Layer.
Step 4
Verify
The signed receipt lands at /verify/<id>; anyone replays it for $0.50. Optional Aligned Layer ZK anchor on Ethereum L1 for publishers who opt in.
Aligned Layer

Benchlist uses Aligned Layer, a proof aggregation network on Ethereum, so any claim on this site is a signed, on-chain attestation. Read the integration spec →

Top attested runs.

All benchmarks

Publish a listing
buyers actually trust.

Run any benchmark. Get an on-chain proof. Post with a single API call, or fill out a form if you’d rather we do it for you.