Where every dollar goes,
broken out per test.

We charge $5 per attested test on standard benchmarks. This page shows the actual cost columns: inference, attestor compute, ZK proof generation, mainnet gas, hosting, margin. Per-benchmark where it varies. Iteration levers where we can squeeze it down.

01What your $5 pays for.

You bring the inference bill.

Benchlist runs the benchmark against your model provider key (Anthropic, OpenAI, OpenRouter, etc.). The inference cost is billed directly by that provider to your account, we never touch it. Our $5 is everything else: attestor execution, Merkle commit, ZK proof generation, Aligned Layer submission, mainnet gas, IPFS hosting, platform ops.

So the $5 breakdown is our operational stack only, not your inference spend. A “standard” test = ~1,000 problems, SP1 proof, Aligned Layer batch of 32.

Line item	Cost	Paid to	Notes
Attestor compute (CPU + I/O + scoring)	$0.40	Attestor operator	Amortized over 50 runs/mo per node
ZK proof generation (SP1)	$1.50	Attestor operator	SP1 prover, 100M cycles · RTX 5090-class
Ethereum L1 gas (Aligned batch)	$1.30	Ethereum miners	$42 batch / 32 runs · scales w/ base fee
IPFS + edge hosting	$0.10	Pinata / Vercel	Transcripts + dataset mirror, amortized
Platform margin	$1.70	Benchlist (Slopshop Inc.)	Team, support, runner + SDK maintenance
Total Benchlist fee	$5.00		Your inference bill is separate

Example: You run HumanEval against Claude Sonnet 4.5 with your own API key. Anthropic bills you ~$0.05 for the inference (their rate, their bill). We bill you $5 for the attested execution + proof + on-chain submission. Total out of pocket: $5.05.

02Per-benchmark pricing tiers.

Since inference is billed by your own model provider, our fee varies only with attestor compute complexity and proof size. Three tiers:

Tier 1

Standard · $5

Any benchmark up to ~20k problems, simple scoring, no external harness. Covers 68 of our 82 suites. HumanEval, MBPP, MMLU, GPQA, GSM8K, IFEval, FRAMES, all $5 regardless of dataset size.

Tier 2

Long-context · $10

Benchmarks with ≥32k token contexts, more proof cycles to Merkle-commit large transcripts. NIAH, RULER, LongBench, ∞Bench, LongMemEval. Higher proof-gen cost, same Benchlist margin.

Tier 3

Agent · $25

Multi-step Docker or browser harness. SWE-bench family, τ-Bench, WebArena, OSWorld. Attestor runs Docker / Playwright / VMs; hosting cost is real, so is the fee.

We used to quote $15-$50 for complex suites to cover inference too, no longer. With your-own-key inference, complex suites are Tier 3 flat $25 from us, whatever your model provider charges separately.

Illustrative matrix (your cost vs. ours)

The “Your inference” column is a rough estimate at Sonnet 4.5 rates; your actual bill depends on which model you pick. “Benchlist fee” is what we charge.

Benchmark	Problems	Your inference (~Sonnet)	Benchlist fee	Total out-of-pocket	Tier
HumanEval	164	~$0.05	$5	~$5.05	Standard
MBPP	974	~$0.30	$5	~$5.30	Standard
MMLU-Pro	12,032	~$2.40	$5	~$7.40	Standard
GSM8K	1,319	~$0.40	$5	~$5.40	Standard
GPQA Diamond	448	~$0.50	$5	~$5.50	Standard
IFEval	541	~$0.35	$5	~$5.35	Standard
FRAMES	824	~$3.60	$5	~$8.60	Standard
LongMemEval	500	~$1.60	$10	~$11.60	Long-ctx
NIAH (128k ctx)	20	~$0.80	$10	~$10.80	Long-ctx
RULER	2,600	~$4.00	$10	~$14.00	Long-ctx
τ-Bench	230 trajectories	~$12.00	$25	~$37.00	Agent
SWE-bench Lite	300	~$14.00	$25	~$39.00	Agent
SWE-bench Verified	500	~$28.00	$25	~$53.00	Agent
WebArena	812	~$38.00	$25	~$63.00	Agent

Your inference estimate is at Claude Sonnet 4.5 rates ($3 in / $15 out per 1M tokens). Opus or o1-class roughly 4×; GPT-4o-mini roughly 1/6×. You see the exact charge on your provider's dashboard after the run. Our fee is fixed per tier regardless of which model you pick.

03Per-proof-system cost.

Publishers can pick a proof system. The tradeoff is prove-time cost vs. on-chain verification cost:

Proof system	Prove time	Prove cost	Proof size	L1 verify gas	Best for
SP1 (default)	8-18 min	$1.50	~1 KB	~300k	Complex eval code, unmodified Python
Risc0	6-14 min	$1.30	~900 B	~280k	GPU-heavy batching
Halo2 (KZG)	25-60 min	$3.20	~750 B	~220k	Post-quantum, long-horizon claims
Groth16-BN254	2-5 min	$0.80	~200 B	~150k	Simple threshold/mean scoring
Plonk (kimchi)	10-30 min	$2.10	~400 B	~200k	Custom circuits
Signed attestation (fallback)	<1 s	$0.05	64 B	~60k	LLM-judged benchmarks (no ZK-friendly score fn)

Signed attestations carry no ZK guarantee but still get the attestor-stake + community-replay layers. We mark them “Attested” instead of “Verified ⛓” on the UI.

04Iteration levers, how we bring the price down.

Three things move the needle, in order of leverage:

Lever 1 · biggest

Batch size

Gas amortizes per batch. 32 → 128 runs per batch drops gas per run from $1.30 → $0.38. Requires more queued volume; comes online as publisher demand grows.

Lever 2

Prover hardware

SP1 + Risc0 have aggressive GPU paths. Moving from 4090-class to H100-class cuts prove time ~40% and per-proof cost ~25%. Capital-intensive but linear.

Lever 3

L2 settlement path

Aligned batches already compress to one L1 proof. Future: an L2 receipt path for dashboards that don’t need mainnet directness. Would drop the gas column to ~$0.10 at the cost of a longer trust path. Not yet live; we prefer L1 honesty.

We publish these internally every month and update this page when the stack shifts. No Ethereum-gas surprise billing, if base fee triples, we eat it for in-flight runs and adjust new quotes.

04aProve locally vs. remote.

Proof generation (SP1 or Risc0, both supported, picked per run via --system) is the single most capital-intensive line in the stack. Attestors have three viable paths; the cryptographic output is identical.

Path	Setup cost	Per-proof cost	Break-even	Best for
Local GPU (RTX 4090 / 5090 / A100)	$1,600 – $8,000 hardware	~$0.20 (power)	~600 proofs	Dedicated attestors, steady volume, founder-operated
Succinct Prover Network (remote)	$0	~$1.50	n/a	Third-party attestors without hardware; bursty load
Risc0 Bonsai (remote Risc0)	$0	~$1.80	n/a	Publishers preferring Risc0 proof system

Benchlist reference attestor runs local on a consumer RTX 5090, SP1 prove time ≈ 5-12 minutes per standard benchmark. Marginal cost is electricity only. Third-party attestors who don’t own a GPU set SP1_PROVER_URL + SP1_API_KEY to outsource proving; the runner auto-detects and routes without code changes. The $1.50 SP1 line item in the main cost table assumes remote proving as a conservative upper bound; local-prove attestors keep that margin.

05Batching economics.

Aligned Layer aggregates proofs into a single on-chain verification. The per-run gas cost is:

gas_per_run = (L1_verify_gas × gas_price + batcher_fee) / batch_size

At current mainnet pricing (~25 gwei base fee, ETH ≈ $3,600):

Batch of 8 runs: ~$4.20 per run
Batch of 32 runs: ~$1.30 per run
Batch of 128 runs: ~$0.38 per run
Batch of 512 runs: ~$0.12 per run

We default to batches of 32 during launch. The system automatically increases batch size as volume grows; users see their effective price drop accordingly (packs get cheaper per run, pay-as-you-go price stays $5 but margin improves).

06Attestor economics.

Attestors earn a share of each run they process. At $5/test, the split is approximately:

$1.90 → attestor (compute reimbursement + margin)
$1.30 → Ethereum gas
$0.80 → model provider
$1.00 → Benchlist (platform + hosting + team)

An attestor break-even at current pricing is ~50 runs/month per node, assuming a GPU amortized over 36 months. Once fleet demand pushes an attestor above 200 runs/month, they become meaningfully profitable at these rates.

Operator guide + join flow: /docs#attestors.

07Price floor, why not cheaper?

We get this question a lot. The honest answer: Ethereum L1 settlement is the floor. The verification contract on mainnet costs gas we don’t control. A proof batch that doesn’t land on L1 isn’t a Benchlist proof by definition.

Competitors who charge <$1 per “verified” test are either:

Not actually settling on a public blockchain (just a signed claim on a private server), or
Running on a testnet or proprietary rollup (free / near-zero gas but no real security guarantee), or
Using a shared batch that rarely lands on-chain (claim of “on-chain settlement” without actual mainnet cadence).

We prefer to be expensive and honest. For use cases that don’t need mainnet directness, the “Signed attestation” fallback above exists at $0.05 amortized.

08Complex suites.

SWE-bench, τ-Bench, WebArena, and anything requiring sandboxed execution, browser automation, or multi-hour agent trajectories are outside the “standard” cost envelope. These are quoted up-front before any run starts.

Typical quotes:

SWE-bench Verified (500 tasks, Docker): $50 per run (cost ~$33)
τ-Bench (tool-calling trajectories): $20 per run (cost ~$16)
WebArena (browser tasks): $60 per run (cost ~$45)
Custom compliance benchmark (negotiated): starts $2,999 setup + $499/mo

These are posted publicly the same way simple suites are. The $5/test default is for “green” rows on the matrix above.

Iteration discipline

We re-run this cost table the first of every month with fresh numbers from the attestor fleet. If costs drop, prices drop. If costs rise, we flag it here before changing pricing. The audit trail is in /changelog.