The only one with a proof.

Four other leaderboards exist in AI. All of them take a vendor's word for it. We're the only one where every score is tamper-evident, replayable, and on-chain.

Benchlist

Artificial Analysis

Papers With Code

HuggingFace LB

Chatbot Arena

On-chain proof

✓

…

Pinned dataset hash

✓

…

author-pinned

per-suite

…

Pinned methodology

✓

docs

paper

✓

opaque

Replay command

✓

…

repo link

✓

…

Independent attestor

✓

self

self-reported

HF runs

user votes

Economic security

ETH stake

…

Dispute protocol

✓

…

GitHub issues

discuss tab

…

Category coverage

1 (LLMs)

~4000 papers

1 (OSS LLMs)

1 (chat)

Multi-category services

LLMs · memory · agents · vector DBs · voice · MCP

LLMs only

research

OSS LLMs

LLMs only

Contamination flags

✓

…

occasional

resistant by design

Open API (free)

CORS-open

CSV export

HF hub

occasional

Embed badge

live-updating

…

static SVG

…

Revenue model

verification fees

ads + subs

acquired by Meta

free

research grant

Takes cut of vendor revenue

never

Where each one shines

Best for

Artificial Analysis

Great UX for comparing LLM price vs. speed vs. quality at a glance. Their calibration benchmarks are respected. But: everything is self-reported by labs, no replay, no proof.

Best for

HuggingFace Open LLM Leaderboard

They do run the evals themselves on HF hardware, the one trust anchor in the space. But scope is narrow (open-weight LLMs only), no tamper-evidence beyond "trust HF," and no dispute protocol.

Best for

Chatbot Arena / LMSys

Contamination-proof by construction (humans vote). But opaque methodology, subjective, can't be replayed, and impossible to certify any specific model's rank.

What only Benchlist does

Cryptographic trust

Every score on our leaderboard corresponds to a verified Ethereum transaction. Click the batch ID to see the proof on-chain. Fake a score, the proof fails, the listing is never published. No one else in AI evaluation has this.

16 categories, one standard

LLMs aren't the whole stack. Memory layers, code agents, vector DBs, RAG, voice, sandboxes, MCP servers, we attest them all. A buyer of a modern AI product is choosing across all sixteen.

Slashable attestors

Six independent runners, each with posted ETH stake. If an upheld dispute shows they ran a benchmark wrong, the stake is slashed. Economic accountability that no one else has.

Replay, not trust

Every run publishes a one-liner that reproduces it on your hardware. We don't ask you to trust us, we hand you the commands to not trust us.

Honest note

HuggingFace runs the evals themselves; that's a legitimate trust model for OSS LLMs. For proprietary APIs, memory, agents, and vector DBs, it falls short. That's the gap we close.

Adjacent categories, eval SaaS + zkML

Eval platforms verify your internal pipeline but publish no leaderboard. zkML startups prove inference ran correctly but don't attest benchmark scores. Neither competes head-on; both are good neighbours.

Player

What they ship

Overlap with Benchlist

Braintrust

Offline eval + prod monitoring SaaS, 10+ F50 customers. $100–500/mo.

None. They test. We sign scores.

Galileo

Real-time eval guardrails via Luna-2. Cisco acquiring Q4'26.

None. Runtime guards, not benchmarks.

Stanford HELM

Holistic eval, 50+ benchmarks, open-source, reproducible.

Upstream dataset partner. We sign their runs.

Scale SEAL

Private prompt evals by domain experts. Contamination-resistant.

None. Closed source; we're open.

MLCommons MLPerf

Hardware-vendor inference benchmarks, peer-reviewed submissions.

Speed, not reasoning quality.

Modulus Labs

zkML proofs of ML inference (up to 1B params).

Proves execution, not score claims.

EZKL

Halo2 circuits for model inference. Production REST API.

Proves inference runs. Adjacent primitive.

Gensyn

Decentralised compute verification; mainnet live April 2026.

Solves training, not evals.

Bittensor SN121

Validator-scored agent evals on Bittensor consensus.

Adjacent; no Ethereum anchor.

Sharp line

zkML proves a model ran. Benchlist proves a model scored. Every leaderboard number in AI needs the second proof, we're the only ones shipping it.

Ready to switch?

If you ship an AI service, a verified Benchlist listing is worth more than any self-reported blog post.

Get verified Browse services How it works