Anchor formalizes ERP agent benchmarking with constraint optimization

Anchor generates task harnesses from constraint specs, producing verifiable ground-truth solutions and state-based rewards that eliminate artifact drift in agent evaluation.

May 28, 2026

Summary

Agent evaluation environments frequently diverge between instruction, environment, oracle, and verifier—making tasks unsolvable or reward-hackable. Anchor's pipeline jointly produces all components from a single spec, letting you generate controllable difficulty benchmarks with known optimal solutions for production ERP workflows.

Why it matters

Agent evaluation environments frequently diverge between instruction, environment, oracle, and verifier—making tasks unsolvable or reward-hackable. Anchor's pipeline jointly produces all components from a single spec, letting you generate controllable difficulty benchmarks with known optimal solutions for production ERP workflows.

Implementation verdict

Replaces manual task harness construction with parametric generation; requires formalization of domain workflows as constraint optimization programs. ERP-Bench (300 tasks, procurement/manufacturing) shows frontier models hit constraints 26.1% of trials but fully optimal solutions only 17.4%—useful for calibrating agent capability but not production-ready. Worth evaluating if you own ERP agent evaluation; task generator and dataset released.

Sources

  1. 1.artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires
  2. 2.frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials
  3. 3.harness-agnostic environments whose rewards depend solely on end-state business correctness

Dev Signal

Get briefs like this in your inbox — free, 3x a week.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.