Supervise agents like services, not scripts
Three patterns—process supervision, state persistence, timeout bounds—took production agent uptime from 71% to 99.4% without changing agent code.
May 19, 2026
Summary
Agent reliability in production depends entirely on operational infrastructure, not model capability. Crashes are common; the win is recovering in under 30 seconds instead of losing a night of work.
Why it matters
Agent reliability in production depends entirely on operational infrastructure, not model capability. Crashes are common; the win is recovering in under 30 seconds instead of losing a night of work.
Implementation verdict
Replaces naive while-loop agents with supervisord/systemd management, SQLite checkpoints per tool call, and signal-based timeouts on every tool. Requires 2–3 hours to wire in; worthwhile immediately if running agents on any infrastructure you control. Author notes operational overhead becomes significant by month three, making managed hosting ($99/mo+) viable alternative.
Sources
- 1.my "agent uptime" went from 71% to 99.4% in a week
- 2.average time-to-recovery on a crash dropped from "next morning when I noticed" to under 30 seconds
- 3.token spend on retries dropped by about 40%
- 4.Checkpoint after every tool call
- 5.Wrap every tool the agent can call in a timeout
- 6.If the same tool times out three times in a row, mark it broken for ten minutes
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.