Every score on this site is a composite of three orthogonal guarantees — cryptographic integrity, economic accountability, and social replay. This page spells out exactly what each one covers, and where each one ends. No handwaving.
A Benchlist attestation is not one claim. It’s three layered claims, each covering what the others can’t. The composition is what makes a number trustworthy.
replay.command. Anyone can rerun bit-for-bit. Divergence > 2σ opens a dispute and risks slashing the original attestor.No single layer is enough alone. ZK proofs can’t know whether an attestor faked transcripts. Attestor stake can’t prevent math errors in the scorer. Replay can’t happen on private test sets. Together they cover the space.
A Benchlist ZK proof is generated inside an SP1 (or Risc0) zkVM. The zkVM runs the scoring function — the real one, bit-for-bit — and emits a proof that the output is correct given the committed inputs.
methodologyHash was applieddatasetHashtranscriptMerkleRootThis is the point. ZK gives you “no math errors, no silent substitution.” It does not give you “this number is a good measure of intelligence.” That last question is semantic; the middle question is cryptographic; we only promise the middle.
The ZK proof assumes transcripts are real. An attestor signature + stake is what makes that assumption costly to break.
Every run is signed by a registered attestor’s Ed25519 key. The signing payload is the full Merkle root, so the attestor can’t post-hoc swap transcripts. The attestor has ≥1 ETH staked in StakeVault; a dispute upheld by community replay slashes the stake.
Every run publishes a replay.command. Anyone with access to the service API and the pinned dataset can rerun it. If someone’s fresh run diverges from the attested score by more than 2σ, that’s a dispute-worthy signal.
Disputes cost 0.1 ETH to file (anti-spam bond, refunded on valid disputes). An accepted dispute slashes up to 100% of the original attestor’s stake, annuls the score, and flags the service listing.
For a Benchlist score to be sufficient — meaning a reasonable buyer can treat it as dispositive — the following conditions compose:
datasetHash resolves to a public IPFS object. You can re-download and re-hash.methodologyHash is a specific git commit of the runner repo. The repo is MIT-licensed and forkable.ServiceManager.verifyBatchInclusion at 0xeF2A…606c).StakeVault with no pending disputes.Conditions 1-4 are cryptographic/economic. Condition 5 is temporal. Condition 6 is editorial. The union is what we mean when we show a Verified ⛓ badge.
A truthful methodology page lists its own holes. Ours:
When a buyer asks “should I trust this number?” the answer is: to the extent these gaps matter for your use case. A research-grade bench comparison tolerates them. A compliance audit may not; for that we offer Private benchmarks + TEE attestation (see Enterprise).
Upstream benchmark repos get re-versioned constantly: questions are added, judges get fixed, labels drift. We snapshot, serialize canonically (JSON-Lines, sorted keys, UTF-8 NFC, LF line endings), compute SHA-256 over the concatenation, and pin. Every run references that hash. Raw bytes live on IPFS.
Changing a single character in the canonical serialization changes the hash. Any downstream proof that references the old hash fails verification automatically.
The runner is a pipx-installable Python package (benchlist-runner). Each run pins:
runnerRepo: URL of the source reporunnerCommit: seven-character short SHA of the commitrunnerVersion: semver tag for human referenceThe committed Merkle root includes a hash of the compiled runner binary too — so a compile-time substitution is caught.
Every benchmark declares canonical decoding: temperature, top-p, max tokens, stop sequences, presence/frequency penalties, system prompt. Runs that deviate carry a non-canonical flag and do not show on the default leaderboard.
Judge-required benchmarks (LongMemEval, FRAMES) pin the judge model by exact fingerprint and the judge prompt by hash. The same judge is used by every attestor for that benchmark version; if the upstream judge model rotates (e.g. OpenAI updates gpt-4o), we fork the benchmark to a new methodologyHash rather than silently accepting drift.
Benchmarks leak into training sets over time. We flag runs as contaminated where public evidence (model release notes, third-party analysis) suggests this. HumanEval and MBPP are flagged above GPT-4-class models. Contaminated scores are shown as lower bounds, not rank-order signals.
Benchmark gaming includes: cherry-picking runs, tuning to the test set, prompt engineering against a specific leaderboard. Our defenses:
Attestors post 1-5 ETH stake in StakeVault. At ETH ≈ $3,600 that’s a $3,600-$18,000 sybil barrier per attestor identity. Quorum mode requires 3-of-5 independent attestors, raising the bar to collusion across multiple staked identities. For compliance-grade work (regulatory, insurance), Benchlist Enterprise operates a vetted pool of KYC’d attestors with additional legal recourse.
Every run exposes a single command that reproduces it:
benchlist run longmemeval \
--service rem-labs \
--model claude-opus-4-7 \
--runs 3 \
--dataset-hash sha256:a1b3… \
--methodology-hash sha256:c3d5…
If your score diverges from the attested one by > 2σ, file a dispute. Your 0.1 ETH bond is refunded when the dispute is upheld.
Dispute resolution is on-chain via DisputeManager. The resolution function runs a re-scoring inside SP1 using the disputant’s fresh transcripts. If the re-score diverges, the original attestor’s stake is slashed proportionally to the deviation. The bond system is identical to Optimism’s fault-proof model — adversarial replay as the ultimate cheap gatekeeper.
Every benchmark page carries its exact datasetHash, methodologyHash, runner repo, canonical decoding, judge config, and contamination flag. Every run page also shows the client-side recomputation of the commitment — in your browser, using SubtleCrypto, so you don’t have to trust us to display the numbers.