SWE-bench Lite — methodology, scoring, and how to verify a published score

Q: What is the contamination risk for SWE-bench Lite?

Contamination risk for SWE-bench Lite is rated medium. Medium-risk benchmarks may have partial leakage but are still informative when used alongside contamination-aware splits.

What is SWE-bench Lite?

SWE-bench Lite measures the model's ability to generate, complete, and debug code from natural-language specifications under deterministic test-suite scoring. The benchmark contains 300 problems, scored using accuracy. Released under the CC0-1.0 license, it has become one of the canonical evaluations cited in model cards, vendor announcements, and procurement decisions.

The benchmark is run end-to-end as a fixed pipeline: load the dataset, prompt the model under a published decoding configuration (temperature 0.0, max tokens 8192), capture every response, and grade against the canonical rubric. The output is a single score plus the per-problem transcripts that produced it.

Why SWE-bench Lite matters

Code benchmarks are how labs justify pricing and how integrators decide which model to wire into a coding agent. A two-point gap on HumanEval can shift millions of dollars in API spend.

For developers wiring a model into a product, SWE-bench Lite is the closest thing to a clean signal of capability — but only if the score was produced honestly. The gap between two cards reading "SWE-bench Lite: 92.3" and "SWE-bench Lite: 91.8" can determine which API gets shipped, but neither number is meaningful without methodology, transcripts, and a way to replay the result.

How SWE-bench Lite is scored

The grading procedure for SWE-bench Lite is deterministic. Each model response is checked against the canonical rubric — for code benchmarks this means executing test suites; for math benchmarks it means parsing the final answer; for reasoning benchmarks it means matching the multiple-choice letter. The pipeline is reproducible: given the same dataset, decoding config, and model checkpoint, you should get the same score (modulo non-determinism in the inference layer itself).

The decoding configuration matters more than most people realize. SWE-bench Lite typically runs at temperature 0.0 — that's the canonical setting. Running at temperature 0.7 or 1.0 changes both the expected score and its variance. Anyone reporting SWE-bench Lite numbers should disclose temperature, max tokens, and any system prompt verbatim.

Common pitfalls in SWE-bench Lite reporting

The same score can mean very different things depending on how it was produced. Here are the failure modes that show up most often when comparing SWE-bench Lite numbers across labs and vendors:

Training-data contamination — many benchmarks predate the model under test, and verbatim solutions sit in the training mix.
Test harness drift — small changes to the runner (timeouts, sandbox, Python version) shift scores by 1-3 percentage points.
Pass@k inflation — quoting pass@10 against another lab's pass@1 hides 5-15 points of headroom.
Sample-set selection — running on 20 problems vs 164 changes both the score and its variance.

None of these are theoretical — they're documented patterns across vendor announcements over the last three years. The cure is methodology disclosure plus replay capability: every claim should ship with the exact runner version, the random seed, the system prompt, the decoding config, and a Merkle root over the transcripts.

Reading a published SWE-bench Lite score critically

When a vendor announcement says "Model X scores 87.4 on SWE-bench Lite," ask:

Was the score on the full 300-problem set, or a sub-sample?
Was it pass@1, pass@10, or self-consistency?
What was the temperature, max-token cap, and system prompt?
What runner version was used? Did the lab patch the upstream evaluator?
Are the transcripts available so anyone can re-grade them?
Is there a cryptographic receipt — a signature, Merkle root, or on-chain anchor — proving the transcripts are the ones that produced the score?

If the answer to any of these is "we don't disclose," treat the number as marketing copy.

Ship a SWE-bench Lite score nobody can challenge

Benchlist runs SWE-bench Lite (and 49 other benchmarks) inside a sandboxed runner, captures every transcript, builds a Merkle commitment, and signs the result with an Ed25519 attestor key. The score lands at a public verify URL anyone can replay in a browser — and you can opt into an Aligned Layer ZK anchor on Ethereum L1 for a buyer who needs a trustless receipt.

Get an API key Read the docs →

How to run SWE-bench Lite on Benchlist

The simplest path is the hosted runner — POST a job and we email the verify URL when it completes:

curl -X POST https://benchlist.ai/api/v1/run \
  -H "Authorization: Bearer $BENCHLIST_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "service": "anthropic-claude",
    "model": "claude-sonnet-4.5",
    "benchmark": "swe-bench-lite",
    "runs": 1,
    "limit": 20,
    "proof_system": "signed",
    "inference_api_key": "managed"
  }'

The response includes a run_id and a verify_url. Within a couple of minutes the worker publishes the result, the email lands in your inbox, and the verify page renders the full transcript tree with the Ed25519 signature live in the browser.

For self-hosted runs, install benchlist-runner via pip, point it at your inference key, and let it produce a signed run.json you can submit through the same API. The runner is open source, the schema is documented at /api/v1, and every step of the pipeline is reproducible offline.

SWE-bench Lite on the Benchlist registry

Every signed SWE-bench Lite run posted to Benchlist is permanently indexed at /benchmarks/swe-bench-lite. The page ranks services and models by score, links to transcripts, and surfaces dispute history. Verified-on-chain runs (those with an Aligned Layer batch anchor) get a distinct chip; signed-only runs are clearly marked.

Self-reported scores from vendor announcements that don't ship transcripts get a "Self-reported" badge so buyers can see the trust gap at a glance. If a vendor wants to upgrade their listing to Attested, they post a signed run via /v1/run and the registry replaces the self-reported number automatically.

FAQ

What does the SWE-bench Lite benchmark measure?

The SWE-bench Lite benchmark measures the model's ability to generate, complete, and debug code from natural-language specifications under deterministic test-suite scoring. It uses accuracy as its primary metric across 300 problems.

How is SWE-bench Lite scored?

Each problem in SWE-bench Lite is graded by a deterministic scorer. The final score is reported as a percentage of problems passed. The dataset license is CC0-1.0.

What is the contamination risk for SWE-bench Lite?

Contamination risk for SWE-bench Lite is rated medium. Medium-risk benchmarks may have partial leakage but are still informative when used alongside contamination-aware splits.

How much does running SWE-bench Lite cost in API calls?

A single full run of SWE-bench Lite costs roughly $25.0 in inference fees on a frontier model. Cheaper / smaller models reduce this by 5-20×.

How do I verify a published SWE-bench Lite score is real?

Use Benchlist's signed-attestation system. Run the benchmark via benchlist run swe-bench-lite or POST /v1/run — the result includes a Merkle root over every transcript, an Ed25519 signature from the attestor, and an optional Aligned Layer ZK anchor. Anyone can replay the signature in the browser.

What are the canonical decoding parameters for SWE-bench Lite?

Per the catalog, SWE-bench Lite runs at temperature 0.0 with a max-tokens cap of 8192. Deviating from these without disclosure makes scores incomparable.