Anchor formalizes ERP agent benchmarking with constraint optimization
Anchor generates task harnesses from constraint specs, producing verifiable ground-truth solutions and state-based rewards that eliminate artifact drift in agent evaluation.
May 28, 2026
Summary
Agent evaluation environments frequently diverge between instruction, environment, oracle, and verifier—making tasks unsolvable or reward-hackable. Anchor's pipeline jointly produces all components from a single spec, letting you generate controllable difficulty benchmarks with known optimal solutions for production ERP workflows.
Why it matters
Agent evaluation environments frequently diverge between instruction, environment, oracle, and verifier—making tasks unsolvable or reward-hackable. Anchor's pipeline jointly produces all components from a single spec, letting you generate controllable difficulty benchmarks with known optimal solutions for production ERP workflows.
Implementation verdict
Replaces manual task harness construction with parametric generation; requires formalization of domain workflows as constraint optimization programs. ERP-Bench (300 tasks, procurement/manufacturing) shows frontier models hit constraints 26.1% of trials but fully optimal solutions only 17.4%—useful for calibrating agent capability but not production-ready. Worth evaluating if you own ERP agent evaluation; task generator and dataset released.
Sources
- 1.artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires
- 2.frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials
- 3.harness-agnostic environments whose rewards depend solely on end-state business correctness
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.