multi-model-routing agentic-workflows delegation-benchmarking orchestration

DecisionBench measures router fidelity across agentic delegation

New benchmark suite isolates delegation routing quality (7.5%–29.5% fidelity-at-1) from end-task quality, revealing that delivery channel beats description content for model selection.

May 27, 2026

Summary

If you're building multi-model orchestration systems, you need to measure routing decisions independently from task outcomes—quality-only evals hide whether your router is actually picking the right model. DecisionBench gives you the substrate to test learned routers, adaptive profiles, and delegation strategies against 23k task instances with normalized metrics.

Why it matters

Implementation verdict

This replaces ad-hoc delegation benchmarking with a standardized reference harness covering GAIA, tau-bench, BFCL. Requires instrumenting your agentic workflow with a call_model interface and optional peer profiles. Worth adopting now if you're evaluating orchestration methods; the released substrate and 220 run archives let you baseline immediately without reproducing their sweep.

Sources

1.mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal
2.routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content
3.a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods
4.We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs