Open-weight models, attested on your hardware.
Every score on this page came from a local Ollama daemon — Q4_K_M quants on a consumer GPU — running the same canonical sample sets as the cloud-API attestations. Each result is Ed25519-signed by the publisher's local attestor and replayable in your browser.
The trust ladder
Benchlist accepts scores from four sources, each with a different trust level. Hover any pill on a leaderboard to see why.
Local model leaderboard
Average score across each model's set of attested benchmark runs. Smaller models are not scored on a curve — these are honest numbers.
| # | Model | Runs | Avg score | Perfect | |
|---|---|---|---|---|---|
| Loading… | |||||
Cross-source comparison by benchmark
For each benchmark we attested locally, see local-attested results next to cloud-attested results and (where applicable) self-reported vendor numbers. Bars are tinted by trust source.
Reproduce locally
Run the same canonical sample sets on your own GPU. Same Ed25519 attestor scheme; results post to /v1/store-run and land in this same registry.
# Pull the runner
git clone https://github.com/benchlist/runner
cd runner
# Run a model you have via Ollama
ollama pull mistral:latest
BENCHLIST_KEY=bl_live_... python3 _local_runner.py \
--models mistral-7b-q4km \
--benches gsm8k,mmlu-pro,arc-challenge,piqa,bbh \
--limit 3