How do I get an API key?

POST to /api/v1/submit with {"kind":"signup","contact":"you@email"}. Free, email-only, key arrives within 60 seconds.

Can I try without an API key?

Yes. Hit POST /api/v1/probe with no auth, body {"benchmark":"gsm8k","model":"openrouter/auto","n":3}. Rate-limited to 1 per IP per hour.

Are scores really on-chain?

Every score is Ed25519 signed by our attestor — that's the trust floor. Optional ZK anchor on Ethereum L1 via Aligned Layer is available; today those proofs are queued, not yet anchored.

How do I dispute a number?

Hit /replay with the run URL — anyone can re-run any attestation for $0.50 to challenge. Or file a formal dispute at /disputes with a 0.1 ETH bond.

What benchmarks are supported?

GSM8K, MMLU-Pro, GPQA, ARC-Challenge, HellaSwag, Winogrande, OpenBookQA, MATH-500, TruthfulQA, CommonsenseQA, HumanEval, BigCodeBench, SWE-bench Lite, LongMemEval, MTEB. Full list at /api/benchmarks.json.

The honestAI benchmark leaderboard

Don't trust us. Re-run it for $0.50.

Every score on Benchlist is a fresh re-run of a public benchmark, signed, with confidence intervals shown inline and contamination flagged honestly. Anyone pays fifty cents to challenge any number, the result lands as an independent receipt. Vendor blog posts shouldn't be the source of truth for which model to ship.

… signed re-runs across 11 benchmarks · every one replayable for $0.50.

No card · one-click login from inbox · single test from $5 · try free, no signup → · browse leaderboard · live attested run

Wilson 95% CIs shown Real HuggingFace samples Contamination flagged

…

Services

…

Benchmark suites

…

Attested runs

…

Publishers

The replay challenge

Pay $0.50. Run it on a different attestor. Compare receipts.

No leaderboard service lets you do this. Pick any signed run, queue an independent re-run on a fresh canonical sample, get a second Ed25519 receipt from a different attestor in under five minutes. Disagreements are public. The protocol does the trust work, we don't ask for it.

Try a replay →

Live attestation · loading…

Pulling from /api/runs.json…

⚠

We flag contaminated benchmarks. Free.

GSM8K is saturated, MMLU is everywhere in training corpora, HumanEval predates most modern training cutoffs. Every leaderboard row carries a contamination tier. Read why GSM8K headlines are noise →

For buyers

Find a model.

Top attested score per benchmark, sorted by trust tier first. Self-reported numbers ranked below cryptographically-signed ones.

Browse /best →

For vendors

Get verified.

Email gets you an API key. POST /v1/run + your benchmark + model. We sign, store, email back the verify URL.

Get an API key →

For researchers

Audit a score.

Paste any run ID. The Ed25519 signature replays in your browser. No server round-trip, no trust required of us. Real cryptography.

Open the proof viewer →

Settled on

Ethereum mainnet Aligned Layer

Last verified

All →

Verified⛓ZK proof on Ethereum L1 via Aligned Layer

AttestedEd25519 signed Merkle commitment, browser-replayable

LocalRun on the publisher's hardware, signed locally · browse →

Self-reportedVendor disclosure, not verified by Benchlist

Top scores across cloud + local on real Hugging Face datasets

Full leaderboard →

Loading attestations…

A certificate of attestation sealed with an emerald wax-seal, threaded to smaller proof cards, Benchlist's commitment chain visualized — Fig. 1, Every score, sealed.

The thesis, briefly

“Self-reported numbers are a race to the bottom. Pick a favorable subset, tune to the eval, publish a blog post. Benchlist puts every score behind a cryptographic proof anyone can re-check, on Ethereum, forever.”

From the about page

One request. End-to-end.

Watch the complete lifecycle, queue, run, commit, prove, batch, settle on mainnet, in under five seconds. Real SHA-256 commitment computed in your browser.

Pipeline

Real API Post your own

This week on Benchlist.

A rolling seven-day digest of every attestation that landed on-chain. Unedited, unspun, computed live from the same JSON the registry serves.

Full leaderboard →

Attested

…

runs, 7 days

Gas burned

…

USD, Ethereum L1

Publishers

…

unique, this week

Median proof

…

minutes, commit→chain

Biggest scores, last seven days

top 5

Leader per benchmark

live

Vendor announcements are a starting point. Replayable signed runs are the proof.

Four reasons every benchmark claim needs one.

Contamination

Training data leaks into the test set.

A signed receipt binds a score to a dataset hash. Swapping the set later is impossible.

Gaming

Leaderboard votes get manipulated.

Cryptographically signed runs cost $5 to fake and anyone can challenge them on-chain.

Audit

Procurement asks how you measured.

Signed dataset hash + methodology hash + Merkle root. You hand over a URL, not a PDF.

Reproducibility

Re-run the claim, bit-for-bit.

Every receipt ships with its replay command. Docker image pinned, seeds pinned, adapter pinned.

We don't replace leaderboards. We sign them.

Every other board runs on trust-me. Benchlist is the cryptographic signature on top.

Board

What they do well

Where they break

HuggingFace Open LLM

Discovery, 85k models indexed

No proof a score is honest

LMSys Chatbot Arena

Vibes, 6M anonymous pairwise votes

Votes are gameable, identity unverified

Artificial Analysis

Specs, 328 models, standard hardware

They run it, you trust them

Benchlist

Every score Ed25519 signed · replayable for $0.50 · optional ZK anchor

Nothing, if you pay the $5

Full comparison →

Two ways to use Benchlist.

Shopping for AI? Get signed quotes. Selling AI? Wear the Certified seal.

For buyers · free

Get signed quotes from 3-5 vendors.

Describe your use case. We match you to vendors whose signed scores on your must-have benchmarks are freshest. Free forever for buyers; vendors pay us per qualified intro.

Request a quote →

For vendors · $499/year

Benchlist Certified.

Quarterly re-attestation on a canonical suite. Seal + embed badge. Priority matching on /quotes. Free dispute coverage. Buyers look for the seal.

Get certified →

Sixteen categories, one standard.

From frontier LLMs to vector search, every listing comes with attested benchmark results.

Recently verified.

View all

Benchmark → attest → publish.

The whole chain is open. You can replay any run bit-for-bit on your own hardware.

Step 1

Run

A trusted attestor runs the benchmark against the service. Full transcripts are stored.

Step 2

Commit

The runner computes a Merkle root over every (prompt, response, judge) tuple plus dataset and methodology hashes.

Step 3

Prove

A ZK proof of the scoring function over the commitment is submitted to Aligned Layer.

Step 4

Verify

The signed receipt lands at /verify/<id>; anyone replays it for $0.50. Optional Aligned Layer ZK anchor on Ethereum L1 for publishers who opt in.

Aligned Layer

Benchlist uses Aligned Layer, a proof aggregation network on Ethereum, so any claim on this site is a signed, on-chain attestation. Read the integration spec →

Top attested runs.

All benchmarks

Publish a listing
buyers actually trust.

Run any benchmark. Get an on-chain proof. Post with a single API call, or fill out a form if you’d rather we do it for you.

Submit service →Read the docs

Don't trust us. Re-run it for $0.50.

One request. End-to-end.

This week on Benchlist.

Vendor announcements are a starting point. Replayable signed runs are the proof.

We don't replace leaderboards. We sign them.

Two ways to use Benchlist.

Sixteen categories, one standard.

Recently verified.

Benchmark → attest → publish.

Top attested runs.

Publish a listingbuyers actually trust.

Publish a listing
buyers actually trust.