Get a signed attestation from your terminal.
Pick your runtime. Run the install line. Authenticate with one command. Post your first attested benchmark in under two minutes.
pip install
pip install benchlist-runner
Then benchlist login opens your browser, gives you a free key, stores it in ~/.benchlist/key. Standalone, zero deps beyond Python 3.10+.
npm install
npm install -g @benchlist/cli
Same surface as the Python CLI. Wraps fetch, ships TypeScript types.
curl-pipe install
curl -sSf https://benchlist.ai/install.sh | sh
Detects platform, drops a single Python binary into $HOME/.local/bin. Reads the script first if you don't trust pipe-to-shell — it's right here.
Just curl the API
curl -sS -X POST https://benchlist.ai/api/v1/probe \
-H "Content-Type: application/json" \
-d '{"benchmark":"gsm8k","model":"anthropic/claude-haiku-4-5","n":3}'
The probe endpoint is anonymous, rate-limited to 1 per IP per hour. Same shape as a paid run, n capped at 3. Result lands at verify_url in the response.
After install, the four-line flow
benchlist login # device-code OAuth, opens browser benchlist demo # synthesise a sample run, no inference benchlist run gsm8k --model claude-sonnet-4-5 # real n=50 attestation, $5 benchlist verify run-abc123 # replays the Ed25519 check locally
For AI agents (MCP)
claude mcp add benchlist python -m benchlist_runner.mcp
Exposes list_benchmarks, get_latest_signed_score, list_runs, verify_run, probe, run_demo as MCP tools. Works with Claude Code, Cursor, Continue, Cline, any MCP host. Full schema at /llms.txt.
GitHub Action
- uses: benchlist/runner/.github/actions/benchlist-attest@v1
with:
benchmark: swe-bench-verified
model: claude-sonnet-4-5-20250929
service: anthropic-claude
n: 50
env:
BENCHLIST_KEY: ${{ secrets.BENCHLIST_KEY }}
Outputs: run-id, score, verify-url, signed. Drop into any release pipeline to auto-attest each tagged version.