agent-monitoring failure-detection llm-agents production-reliability benchmarking

LLM judges fail detecting false agent success

Lightweight TF-IDF detectors outperform LLM judges by 4–8x at catching agents that falsely claim task completion, with 3,300x lower latency.

Summary

Agent silent failures—tasks reported done but actually incomplete—corrupt production monitoring. Relying on LLM judges to catch this costs you latency and misses failures; domain-calibrated statistical detectors are the practical alternative.

Why it matters

Implementation verdict

Replace LLM-based agent completion verification with task-specific TF-IDF triage. Requires baseline labeling on your domain (tau2-bench: AUROC 0.83; AppWorld: 0.95 achieved). Worth deploying now as monitoring layer—no latency penalty, proven higher recall.

Sources

1.False success is common but varies by setting: 45--48% of failures in single-control tau2-bench domains, 3% in dual-control telecom, and 75.8% among AppWorld self-assessing coding-agent trajectories
2.no configuration across 5 judges, 5 prompt strategies, and full task specifications exceeds AUROC 0.65 on tau2-bench
3.Lightweight TF-IDF detectors achieve task-disjoint AUROC 0.83 on tau2-bench and 0.95 on AppWorld, recovering 4--8x more false successes than the best judge at the same flag rate with 3,300x lower latency

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs