swe-bench agent-harness code-generation benchmark llm-architecture

Harness design outperforms model upgrades on SWE-Bench

A well-engineered adapter layer can deliver 54-point Pass@1 gains on the same model, matching or exceeding the impact of swapping LLMs entirely.

Summary

Most teams chase larger models while leaving harness architecture as fixed plumbing. Optimizing patch extraction, workspace contracts, and diff adapters is cheaper and faster than model scaling, and directly controls agent reliability on code tasks.

Why it matters

Implementation verdict

Replace your baseline harness with a modular, cost-aware adapter before buying a bigger LLM. Requires systematic testing of workspace contracts and patch-extraction strategies. Worth implementing now—the gains are large and the work is localized to your agent layer, not model training.

Sources

1.a well‑engineered adapter lifts Pass@1 by over 50 percentage points while keeping the same model
2.a minimal direct‑diff adapter scores 19.1 % Pass@1, but the full adapter reaches 73.4 %, a 54.3‑point improvement generated solely by harness tweaks
3.model choice adds 29.4 pp whereas harness choice adds 27.4 pp
4.teams should prioritize a modular, cost‑aware adapter layer before investing in larger LLMs

Dev Signal

Get briefs like this in your inbox — free, 3x a week.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs