NIAH · Needle in a Haystack, methodology, history, and how to verify a published score

Q: How do I verify a NIAH · Needle in a Haystack score?

Run via benchlist run niah and replay the Ed25519 signature at /verify/ .

History

Retrieve a seeded fact from 4k → 128k token contexts. Deterministic generator.

This benchmark is indexed in the Benchlist registry. We're working on a deeper guide, request priority coverage → if you'd like this article expanded.

How NIAH · Needle in a Haystack is graded

The canonical run uses temperature 0.0 and max_tokens 128. 20 problems graded by accuracy. License: MIT.

Common pitfalls when reporting NIAH · Needle in a Haystack

The same number can mean very different things depending on how it was produced. The biggest failure modes specific to this benchmark:

See /methodology for general benchmark-reporting pitfalls.

Live Benchlist leaderboard

Top attested scores from the Benchlist registry, hydrated client-side from /api/runs.json. Self-reported numbers are de-prioritised, attested results from a real signed transcript always rank above vendor-disclosed ones.

Top scores · NIAH · Needle in a Haystack

Full leaderboard →

Loading…

How to ship a NIAH · Needle in a Haystack score that nobody can challenge

Run NIAH · Needle in a Haystack on Benchlist

Benchlist runs the canonical NIAH · Needle in a Haystack sample set, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay, and you can opt into an Aligned Layer ZK anchor on Ethereum L1.

Get an API key Read the docs →

Hosted runner, POST a job and we email the verify URL when it's done:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "niah",
    "runs": 1,
    "limit": 50,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

Self-hosted, install benchlist-runner via pip, point it at your inference key, get a signed run.json:

pip install benchlist-runner
benchlist run niah --service anthropic-claude --model claude-sonnet-4.5 --limit 50
benchlist publish run.json

FAQ

What is NIAH · Needle in a Haystack?

NIAH · Needle in a Haystack is an AI evaluation benchmark indexed by Benchlist with deterministic grading.

How do I verify a NIAH · Needle in a Haystack score?

Run via benchlist run niah and replay the Ed25519 signature at /verify/.