Where every dollar goes,
broken out per test.

We charge $5 per attested test on standard benchmarks. This page shows the actual cost columns: inference, attestor compute, ZK proof generation, mainnet gas, hosting, margin. Per-benchmark where it varies. Iteration levers where we can squeeze it down.

01What your $5 pays for.

You bring the inference bill.

Benchlist runs the benchmark against your model provider key (Anthropic, OpenAI, OpenRouter, etc.). The inference cost is billed directly by that provider to your account, we never touch it. Our $5 is everything else: attestor execution, Merkle commit, ZK proof generation, Aligned Layer submission, mainnet gas, IPFS hosting, platform ops.

So the $5 breakdown is our operational stack only, not your inference spend. A “standard” test = ~1,000 problems, SP1 proof, Aligned Layer batch of 32.

Line itemCostPaid toNotes
Attestor compute (CPU + I/O + scoring)$0.40Attestor operatorAmortized over 50 runs/mo per node
ZK proof generation (SP1)$1.50Attestor operatorSP1 prover, 100M cycles · RTX 5090-class
Ethereum L1 gas (Aligned batch)$1.30Ethereum miners$42 batch / 32 runs · scales w/ base fee
IPFS + edge hosting$0.10Pinata / VercelTranscripts + dataset mirror, amortized
Platform margin$1.70Benchlist (Slopshop Inc.)Team, support, runner + SDK maintenance
Total Benchlist fee$5.00Your inference bill is separate

Example: You run HumanEval against Claude Sonnet 4.5 with your own API key. Anthropic bills you ~$0.05 for the inference (their rate, their bill). We bill you $5 for the attested execution + proof + on-chain submission. Total out of pocket: $5.05.

02Per-benchmark pricing tiers.

Since inference is billed by your own model provider, our fee varies only with attestor compute complexity and proof size. Three tiers:

Tier 1
Standard · $5

Any benchmark up to ~20k problems, simple scoring, no external harness. Covers 68 of our 82 suites. HumanEval, MBPP, MMLU, GPQA, GSM8K, IFEval, FRAMES, all $5 regardless of dataset size.

Tier 2
Long-context · $10

Benchmarks with ≥32k token contexts, more proof cycles to Merkle-commit large transcripts. NIAH, RULER, LongBench, ∞Bench, LongMemEval. Higher proof-gen cost, same Benchlist margin.

Tier 3
Agent · $25

Multi-step Docker or browser harness. SWE-bench family, τ-Bench, WebArena, OSWorld. Attestor runs Docker / Playwright / VMs; hosting cost is real, so is the fee.

We used to quote $15-$50 for complex suites to cover inference too, no longer. With your-own-key inference, complex suites are Tier 3 flat $25 from us, whatever your model provider charges separately.

Illustrative matrix (your cost vs. ours)

The “Your inference” column is a rough estimate at Sonnet 4.5 rates; your actual bill depends on which model you pick. “Benchlist fee” is what we charge.

Benchmark Problems Your inference (~Sonnet) Benchlist fee Total out-of-pocket Tier
HumanEval164~$0.05$5~$5.05Standard
MBPP974~$0.30$5~$5.30Standard
MMLU-Pro12,032~$2.40$5~$7.40Standard
GSM8K1,319~$0.40$5~$5.40Standard
GPQA Diamond448~$0.50$5~$5.50Standard
IFEval541~$0.35$5~$5.35Standard
FRAMES824~$3.60$5~$8.60Standard
LongMemEval500~$1.60$10~$11.60Long-ctx
NIAH (128k ctx)20~$0.80$10~$10.80Long-ctx
RULER2,600~$4.00$10~$14.00Long-ctx
τ-Bench230 trajectories~$12.00$25~$37.00Agent
SWE-bench Lite300~$14.00$25~$39.00Agent
SWE-bench Verified500~$28.00$25~$53.00Agent
WebArena812~$38.00$25~$63.00Agent

Your inference estimate is at Claude Sonnet 4.5 rates ($3 in / $15 out per 1M tokens). Opus or o1-class roughly 4×; GPT-4o-mini roughly 1/6×. You see the exact charge on your provider's dashboard after the run. Our fee is fixed per tier regardless of which model you pick.

03Per-proof-system cost.

Publishers can pick a proof system. The tradeoff is prove-time cost vs. on-chain verification cost:

Proof systemProve timeProve costProof sizeL1 verify gasBest for
SP1 (default)8-18 min$1.50~1 KB~300kComplex eval code, unmodified Python
Risc06-14 min$1.30~900 B~280kGPU-heavy batching
Halo2 (KZG)25-60 min$3.20~750 B~220kPost-quantum, long-horizon claims
Groth16-BN2542-5 min$0.80~200 B~150kSimple threshold/mean scoring
Plonk (kimchi)10-30 min$2.10~400 B~200kCustom circuits
Signed attestation (fallback)<1 s$0.0564 B~60kLLM-judged benchmarks (no ZK-friendly score fn)

Signed attestations carry no ZK guarantee but still get the attestor-stake + community-replay layers. We mark them “Attested” instead of “Verified ⛓” on the UI.

04Iteration levers, how we bring the price down.

Three things move the needle, in order of leverage:

Lever 1 · biggest
Batch size

Gas amortizes per batch. 32 → 128 runs per batch drops gas per run from $1.30 → $0.38. Requires more queued volume; comes online as publisher demand grows.

Lever 2
Prover hardware

SP1 + Risc0 have aggressive GPU paths. Moving from 4090-class to H100-class cuts prove time ~40% and per-proof cost ~25%. Capital-intensive but linear.

Lever 3
L2 settlement path

Aligned batches already compress to one L1 proof. Future: an L2 receipt path for dashboards that don’t need mainnet directness. Would drop the gas column to ~$0.10 at the cost of a longer trust path. Not yet live; we prefer L1 honesty.

We publish these internally every month and update this page when the stack shifts. No Ethereum-gas surprise billing, if base fee triples, we eat it for in-flight runs and adjust new quotes.

04aProve locally vs. remote.

Proof generation (SP1 or Risc0, both supported, picked per run via --system) is the single most capital-intensive line in the stack. Attestors have three viable paths; the cryptographic output is identical.

PathSetup costPer-proof costBreak-evenBest for
Local GPU (RTX 4090 / 5090 / A100) $1,600 – $8,000 hardware ~$0.20 (power) ~600 proofs Dedicated attestors, steady volume, founder-operated
Succinct Prover Network (remote) $0 ~$1.50 n/a Third-party attestors without hardware; bursty load
Risc0 Bonsai (remote Risc0) $0 ~$1.80 n/a Publishers preferring Risc0 proof system

Benchlist reference attestor runs local on a consumer RTX 5090, SP1 prove time ≈ 5-12 minutes per standard benchmark. Marginal cost is electricity only. Third-party attestors who don’t own a GPU set SP1_PROVER_URL + SP1_API_KEY to outsource proving; the runner auto-detects and routes without code changes. The $1.50 SP1 line item in the main cost table assumes remote proving as a conservative upper bound; local-prove attestors keep that margin.

05Batching economics.

Aligned Layer aggregates proofs into a single on-chain verification. The per-run gas cost is:

gas_per_run = (L1_verify_gas × gas_price + batcher_fee) / batch_size

At current mainnet pricing (~25 gwei base fee, ETH ≈ $3,600):

  • Batch of 8 runs: ~$4.20 per run
  • Batch of 32 runs: ~$1.30 per run
  • Batch of 128 runs: ~$0.38 per run
  • Batch of 512 runs: ~$0.12 per run

We default to batches of 32 during launch. The system automatically increases batch size as volume grows; users see their effective price drop accordingly (packs get cheaper per run, pay-as-you-go price stays $5 but margin improves).

06Attestor economics.

Attestors earn a share of each run they process. At $5/test, the split is approximately:

  • $1.90 → attestor (compute reimbursement + margin)
  • $1.30 → Ethereum gas
  • $0.80 → model provider
  • $1.00 → Benchlist (platform + hosting + team)

An attestor break-even at current pricing is ~50 runs/month per node, assuming a GPU amortized over 36 months. Once fleet demand pushes an attestor above 200 runs/month, they become meaningfully profitable at these rates.

Operator guide + join flow: /docs#attestors.

07Price floor, why not cheaper?

We get this question a lot. The honest answer: Ethereum L1 settlement is the floor. The verification contract on mainnet costs gas we don’t control. A proof batch that doesn’t land on L1 isn’t a Benchlist proof by definition.

Competitors who charge <$1 per “verified” test are either:

  • Not actually settling on a public blockchain (just a signed claim on a private server), or
  • Running on a testnet or proprietary rollup (free / near-zero gas but no real security guarantee), or
  • Using a shared batch that rarely lands on-chain (claim of “on-chain settlement” without actual mainnet cadence).

We prefer to be expensive and honest. For use cases that don’t need mainnet directness, the “Signed attestation” fallback above exists at $0.05 amortized.

08Complex suites.

SWE-bench, τ-Bench, WebArena, and anything requiring sandboxed execution, browser automation, or multi-hour agent trajectories are outside the “standard” cost envelope. These are quoted up-front before any run starts.

Typical quotes:

  • SWE-bench Verified (500 tasks, Docker): $50 per run (cost ~$33)
  • τ-Bench (tool-calling trajectories): $20 per run (cost ~$16)
  • WebArena (browser tasks): $60 per run (cost ~$45)
  • Custom compliance benchmark (negotiated): starts $2,999 setup + $499/mo

These are posted publicly the same way simple suites are. The $5/test default is for “green” rows on the matrix above.

Iteration discipline

We re-run this cost table the first of every month with fresh numbers from the attestor fleet. If costs drop, prices drop. If costs rise, we flag it here before changing pricing. The audit trail is in /changelog.