DecisionBench measures router fidelity across agentic delegation
New benchmark suite isolates delegation routing quality (7.5%–29.5% fidelity-at-1) from end-task quality, revealing that delivery channel beats description content for model selection.
May 27, 2026
Summary
If you're building multi-model orchestration systems, you need to measure routing decisions independently from task outcomes—quality-only evals hide whether your router is actually picking the right model. DecisionBench gives you the substrate to test learned routers, adaptive profiles, and delegation strategies against 23k task instances with normalized metrics.
Why it matters
If you're building multi-model orchestration systems, you need to measure routing decisions independently from task outcomes—quality-only evals hide whether your router is actually picking the right model. DecisionBench gives you the substrate to test learned routers, adaptive profiles, and delegation strategies against 23k task instances with normalized metrics.
Implementation verdict
This replaces ad-hoc delegation benchmarking with a standardized reference harness covering GAIA, tau-bench, BFCL. Requires instrumenting your agentic workflow with a call_model interface and optional peer profiles. Worth adopting now if you're evaluating orchestration methods; the released substrate and 220 run archives let you baseline immediately without reproducing their sweep.
Sources
- 1.mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal
- 2.routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content
- 3.a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods
- 4.We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.