Circuit Breakers for AI Agents: Preventing Cascade Failures
How the circuit breaker pattern prevents a failing LLM provider from burning your AI agent budget — with SQLite-backed state and exponential backoff.
When agents fail in loops
At 3 AM on a Tuesday, our marketing agent tried to post a weekly SEO report. The LLM provider was down. The agent retried. Failed. Retried. Failed. Retried 47 times in 12 minutes before we caught it. Cost: $8.40 in failed API calls that returned errors after partial token consumption.
This is the cascade failure problem. One service goes down, and every agent that depends on it hammers the failing endpoint, burning budget and generating noise.
The circuit breaker pattern
The concept comes from electrical engineering. When current exceeds safe limits, the breaker trips and cuts the circuit. In software, Martin Fowler popularized it for microservices. We adapted it for AI agents.
Klawty's circuit breaker has three states:
- Closed — Normal operation. Failures are counted. - Open — Tripped. All requests short-circuit immediately with a cached error. No API calls made. - Half-open — After the backoff period, one test request is allowed through. If it succeeds, the breaker closes. If it fails, it opens again with a longer backoff.
The numbers
Five consecutive failures trip the breaker. Not five failures total — five in a row. Intermittent errors don't trip it. The backoff schedule is exponential:
Trip 1: 1 hour
Trip 2: 2 hours
Trip 3: 4 hours
Trip 4: 8 hours (max)
Every day at midnight CET, all circuit breakers reset. This prevents a transient outage from permanently disabling an agent.
Single source of truth
The critical design decision: the circuit breaker state lives in SQLite, not in memory. The agent_circuit_breaker table is shared between the task executor and the orchestrator. Both processes read the same state. No race conditions. No split-brain.
CREATE TABLE agent_circuit_breaker (
agent TEXT PRIMARY KEY,
state TEXT DEFAULT 'closed', -- closed | open | half_open
failure_count INTEGER DEFAULT 0,
last_failure_at TEXT,
trip_count INTEGER DEFAULT 0,
backoff_until TEXT,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
);
When the executor picks up a task for agent zara, it checks the breaker first:
const breaker = taskDb.cbGetBreaker('zara');
if (breaker.state === 'open' && new Date() < new Date(breaker.backoff_until)) {
// Skip this agent entirely — no API call, no cost
return { skipped: true, reason: 'circuit_breaker_open' };
}
Why this matters for cost
Without a circuit breaker, a 4-hour LLM outage with 8 agents on 15-minute cycles generates 128 failed API calls. At an average of $0.06 per failed call (partial token consumption), that's $7.68 wasted.
With the circuit breaker, the first agent to hit 5 failures trips the breaker. Total failed calls: 5. Total cost: $0.30. The remaining 123 calls are short-circuited locally — zero API spend.
The circuit breaker also sends a Telegram alert when it trips, so you know something is wrong without watching logs. When the half-open test succeeds, you get a recovery notification. The system heals itself, and you sleep through the night.