agent-evaluation benchmark constraint-optimization erp-systems task-generation

Anchor formalizes ERP agent benchmarking with constraint optimization

Anchor generates task harnesses from constraint specs, producing verifiable ground-truth solutions and state-based rewards that eliminate artifact drift in agent evaluation.

May 28, 2026

Summary

Agent evaluation environments frequently diverge between instruction, environment, oracle, and verifier—making tasks unsolvable or reward-hackable. Anchor's pipeline jointly produces all components from a single spec, letting you generate controllable difficulty benchmarks with known optimal solutions for production ERP workflows.

Why it matters

Implementation verdict

Replaces manual task harness construction with parametric generation; requires formalization of domain workflows as constraint optimization programs. ERP-Bench (300 tasks, procurement/manufacturing) shows frontier models hit constraints 26.1% of trials but fully optimal solutions only 17.4%—useful for calibrating agent capability but not production-ready. Worth evaluating if you own ERP agent evaluation; task generator and dataset released.

Sources

1.artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires
2.frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials
3.harness-agnostic environments whose rewards depend solely on end-state business correctness

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs