Every run pins 7 hashes.
Most "verified" benchmark sites just sign a number. We sign the code that produced the number. Every run.json embeds a 7-field provenance block, all SHA-256 hashes, all individually re-derivable by anyone with our public source. The on-chain ZK batch (when anchored on Aligned) commits to a single digest over all seven, so a single chain ID pins the entire test setup.
The 7 hashes
datasetHashSHA-256 of the canonical HuggingFace sample set. If anyone changes which problems we ran against, this hash changes and the proof breaks. Verify by re-fetching the dataset and hashing it.
methodologyHashSHA-256 of the scoring rules (e.g. "exact match on stripped output," "pass@1 on Python AST equivalence"). Pinning this means we can't quietly soften the grader between runs.
transcriptMerkleRootRoot of a Merkle tree over every (prompt, response, judge_verdict) tuple. The SP1 program in our zkVM re-derives this root from the leaves and asserts equality, so a counterfeit transcript would fail to prove.
scoreThe value that becomes the public number. Pinned to 6 decimal places inside the proof so floating-point drift can't hide a different result.
runner_provenance.runner_commitGit SHA of the runner repo that produced this run. git show $runner_commit reproduces the exact source tree.
runner_provenance.adapter_hashSHA-256 of the adapter source file (e.g. runner/adapters/humaneval.py) that loaded the dataset, queried the model, and graded responses. Adapter source is MIT-licensed and on GitHub.
SHA-256 of the judge function (where applicable), the requirements.txt lockfile, the system prompt, and the chat template. All bundled into the runner_provenance.digest that becomes a public input to the SP1 proof.
The on-chain commitment
When a run is anchored via Aligned Layer, the SP1 zkVM proof commits to one final digest derived from all 7 hashes:
final_digest = SHA-256( dataset_hash | methodology_hash | merkle_root | claimed_score | runner_provenance.digest )
That single digest is what Aligned's BatcherPaymentService records on Ethereum L1. To dispute a run, an adversary has to either: (a) prove the SP1 zkVM is broken (hard), (b) find a SHA-256 collision (essentially impossible), or (c) fabricate a different runner repo that hashes to the same commit (impossible without breaking SHA-256).
How to re-verify any run
git clone https://github.com/benchlist/runner
cd runner
git checkout <runner_commit_from_run.json>
# 1. Re-hash the adapter
sha256sum adapters/<benchmark_id>.py
# → must match runner_provenance.adapter_hash
# 2. Re-hash the lockfile
sha256sum requirements.txt
# → must match runner_provenance.lockfile_hash
# 3. Re-derive the digest
python -c "
import hashlib
fields = ['$RUNNER_VERSION', '$RUNNER_COMMIT', '$ADAPTER_HASH', '$JUDGE_HASH',
'$LOCKFILE_HASH', '$SYSPROMPT_HASH', '$TEMPLATE_HASH']
print('sha256:' + hashlib.sha256('|'.join(fields).encode()).hexdigest())
"
# → must match runner_provenance.digest
If any of the above doesn't match, the run is invalid. File at /disputes for a 0.02 ETH bounty if upheld.
We don't sign the model weights themselves. Open-weight models are reproducible by their HuggingFace commit; closed models (Claude, GPT) are pinned by the API string only — the provider could quietly swap the underlying weights without us noticing, which is exactly what Provider Verified is designed to detect via continuous canonical-vs-host drift attestation.