2026-03-12 · dcode · monitoring, production, observability

Monitoring AI Agents in Production: What to Watch and When to Panic

The health monitor checks every 60 seconds — service heartbeats, DB integrity, failure rates. Here's what to monitor and what the alert thresholds should be.

You can't tail -f your way out of this

Autonomous agents fail differently than web services. A web server goes down and returns 500s. An agent fails silently — it runs, reports success, and produces garbage. Or it enters a retry loop that looks healthy from the outside. Or one agent's failure cascades through task handoffs to three others.

Monitoring agents in production requires watching different signals than traditional infrastructure.

The health monitor

Klawty's health monitor runs every 60 seconds. It checks six categories:

Service heartbeats — Is each agent process running? Last heartbeat timestamp within 2x the expected cycle time. If Leila's cycle is 15 minutes and her last heartbeat was 35 minutes ago, something is wrong.

Database integrity — SQLite WAL file size, journal mode, table counts. A WAL file over 100MB indicates a checkpoint issue. A missing table indicates corruption. Both are critical alerts.

Disk space — Log files, session logs, and database files grow continuously. The monitor alerts at 80% disk usage and again at 90%. Agents generate roughly 2GB/month of logs if left unchecked.

Agent staleness — How long since each agent completed a task? An agent that hasn't completed anything in 3x its cycle time might be stuck in a retry loop, blocked by a circuit breaker, or running a task that's taking too long.

Failure rates — Per-agent failure percentage over the last hour. This is the most important metric.

10% failure rate → warning (yellow)
15% failure rate → critical (red)
25% failure rate → circuit breaker trips automatically

Task queue depth — How many tasks are stuck in in_progress for longer than 30 minutes? A growing queue of stale tasks means the executor is bottlenecked or failing silently.

Alert levels

Not everything deserves a 3 AM notification. The monitor uses three levels:

Info — Logged only. Agent completed a discovery cycle. Circuit breaker reset. Backup completed. You review these in the morning.

Warning — Posted to Discord #war-room. Agent failure rate above 10%. Disk at 80%. WAL file growing. Someone should look at this during business hours.

Critical — Telegram alert to the operator's phone. Service down for more than 2 cycles. Database integrity issue. Failure rate above 15%. Circuit breaker tripped with no recovery after 2 hours. This needs attention now.

// Alert routing
if (level === 'critical') {
  await sendTelegram(operatorChatId, formatAlert(alert));
  await postToDiscord(WAR_ROOM, formatAlert(alert));
} else if (level === 'warning') {
  await postToDiscord(WAR_ROOM, formatAlert(alert));
}
// Info: logged to file only

The scorecard

Beyond real-time alerts, the health monitor maintains a per-agent scorecard: tasks completed today, failure rate (7-day rolling), average task duration, LLM cost today, circuit breaker trips this week.

SELECT
  agent,
  COUNT(CASE WHEN status = 'done' THEN 1 END) as completed,
  COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed,
  ROUND(AVG(julianday(completed_at) - julianday(started_at)) * 86400, 1) as avg_seconds
FROM tasks
WHERE created_at > datetime('now', '-24 hours')
GROUP BY agent;

When to panic

Three scenarios that mean something is genuinely broken:

1. Circuit breaker tripped + no recovery in 8 hours — The LLM provider is down or your API key is invalid. Check the provider status page, then check your key balance.

2. Multiple agents failing on the same task type — The tool is broken, not the agents. Read the error in the task metadata and fix the tool definition.

3. Database WAL over 500MB — SQLite checkpointing has stopped. The database will eventually become unreadable. Run PRAGMA wal_checkpoint(TRUNCATE); immediately.

Everything else — individual task failures, transient API errors, one agent's elevated failure rate — the system handles automatically through retries, circuit breakers, and model fallback. That's the point of building the monitoring layer: you know exactly when your attention is required, and the rest of the time, the agents handle it themselves.