Supervise agents like services, not scripts

Three patterns—process supervision, state persistence, timeout bounds—took production agent uptime from 71% to 99.4% without changing agent code.

May 19, 2026

Summary

Agent reliability in production depends entirely on operational infrastructure, not model capability. Crashes are common; the win is recovering in under 30 seconds instead of losing a night of work.

Why it matters

Agent reliability in production depends entirely on operational infrastructure, not model capability. Crashes are common; the win is recovering in under 30 seconds instead of losing a night of work.

Implementation verdict

Replaces naive while-loop agents with supervisord/systemd management, SQLite checkpoints per tool call, and signal-based timeouts on every tool. Requires 2–3 hours to wire in; worthwhile immediately if running agents on any infrastructure you control. Author notes operational overhead becomes significant by month three, making managed hosting ($99/mo+) viable alternative.

Sources

  1. 1.my "agent uptime" went from 71% to 99.4% in a week
  2. 2.average time-to-recovery on a crash dropped from "next morning when I noticed" to under 30 seconds
  3. 3.token spend on retries dropped by about 40%
  4. 4.Checkpoint after every tool call
  5. 5.Wrap every tool the agent can call in a timeout
  6. 6.If the same tool times out three times in a row, mark it broken for ten minutes

Dev Signal

Get briefs like this in your inbox — free, 3x a week.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.