The only one with a proof.

Four other leaderboards exist in AI. All of them take a vendor's word for it. We're the only one where every score is tamper-evident, replayable, and on-chain.

Benchlist
Artificial Analysis
Papers With Code
HuggingFace LB
Chatbot Arena
On-chain proof
Pinned dataset hash
author-pinned
per-suite
Pinned methodology
docs
paper
opaque
Replay command
repo link
Independent attestor
self
self-reported
HF runs
user votes
Economic security
ETH stake
Dispute protocol
GitHub issues
discuss tab
Category coverage
16
1 (LLMs)
~4000 papers
1 (OSS LLMs)
1 (chat)
Multi-category services
LLMs · memory · agents · vector DBs · voice · MCP
LLMs only
research
OSS LLMs
LLMs only
Contamination flags
occasional
resistant by design
Open API (free)
CORS-open
login
CSV export
HF hub
occasional
Embed badge
live-updating
static SVG
static SVG
Revenue model
verification fees
ads + subs
acquired by Meta
free
research grant
Takes cut of vendor revenue
never
never
never
never
never

Where each one shines

Best for
Artificial Analysis

Great UX for comparing LLM price vs. speed vs. quality at a glance. Their calibration benchmarks are respected. But: everything is self-reported by labs, no replay, no proof.

Best for
HuggingFace Open LLM Leaderboard

They do run the evals themselves on HF hardware, the one trust anchor in the space. But scope is narrow (open-weight LLMs only), no tamper-evidence beyond "trust HF," and no dispute protocol.

Best for
Chatbot Arena / LMSys

Contamination-proof by construction (humans vote). But opaque methodology, subjective, can't be replayed, and impossible to certify any specific model's rank.

What only Benchlist does

Cryptographic trust

Every score on our leaderboard corresponds to a verified Ethereum transaction. Click the batch ID to see the proof on-chain. Fake a score, the proof fails, the listing is never published. No one else in AI evaluation has this.

16 categories, one standard

LLMs aren't the whole stack. Memory layers, code agents, vector DBs, RAG, voice, sandboxes, MCP servers, we attest them all. A buyer of a modern AI product is choosing across all sixteen.

Slashable attestors

Six independent runners, each with posted ETH stake. If an upheld dispute shows they ran a benchmark wrong, the stake is slashed. Economic accountability that no one else has.

Replay, not trust

Every run publishes a one-liner that reproduces it on your hardware. We don't ask you to trust us, we hand you the commands to not trust us.

Honest note

HuggingFace runs the evals themselves; that's a legitimate trust model for OSS LLMs. For proprietary APIs, memory, agents, and vector DBs, it falls short. That's the gap we close.

Adjacent categories, eval SaaS + zkML

Eval platforms verify your internal pipeline but publish no leaderboard. zkML startups prove inference ran correctly but don't attest benchmark scores. Neither competes head-on; both are good neighbours.

Player
What they ship
Overlap with Benchlist
Braintrust
Offline eval + prod monitoring SaaS, 10+ F50 customers. $100–500/mo.
None. They test. We sign scores.
Galileo
Real-time eval guardrails via Luna-2. Cisco acquiring Q4'26.
None. Runtime guards, not benchmarks.
Stanford HELM
Holistic eval, 50+ benchmarks, open-source, reproducible.
Upstream dataset partner. We sign their runs.
Scale SEAL
Private prompt evals by domain experts. Contamination-resistant.
None. Closed source; we're open.
MLCommons MLPerf
Hardware-vendor inference benchmarks, peer-reviewed submissions.
Speed, not reasoning quality.
Modulus Labs
zkML proofs of ML inference (up to 1B params).
Proves execution, not score claims.
EZKL
Halo2 circuits for model inference. Production REST API.
Proves inference runs. Adjacent primitive.
Gensyn
Decentralised compute verification; mainnet live April 2026.
Solves training, not evals.
Bittensor SN121
Validator-scored agent evals on Bittensor consensus.
Adjacent; no Ethereum anchor.
Sharp line

zkML proves a model ran. Benchlist proves a model scored. Every leaderboard number in AI needs the second proof, we're the only ones shipping it.

Ready to switch?

If you ship an AI service, a verified Benchlist listing is worth more than any self-reported blog post.

Get verified Browse services How it works