production data

Production benchmarks.

Real numbers from a real deployment. 8 agents, 1,000+ tasks per month, running 24/7 on a Mac mini in Luxembourg. No synthetic tests, no cherry-picked results — just what the system actually costs and how it actually performs.

cost efficiency

91% cheaper than single-model

5-tier LLM routing sends each task to the cheapest model that can handle it. Health checks don't need GPT-4. Strategic decisions don't run on Qwen.

cost-overview.sh
MetricValue
Monthly tasks completed1,018+
Monthly LLM cost~€37 ($40.33)
Cost per task~€0.036
Daily average~€1.23
Coordinated agents8
LLM providers used5
Data period: 30-day rolling average, production deployment
llm-routing --tiers
TierModelCost/M tokens% of tasksUse case
NanoQwen 3.5 Flash$0.07~35%Health checks, status, simple routing
WorkhorseDeepSeek V3.2$0.27~30%Email triage, CRM updates, routine tasks
CapableGemini 3 Flash$1.25~20%Content drafting, analysis, SEO
PowerKimi K2.5$5.00~12%Complex reasoning, proposals, multi-step
PremiumClaude Sonnet 4.6$15.00~3%Strategic decisions, BI reports, edge cases
Models are swappable via openclaw.json — these are current defaults
cost-compare --naive-vs-smart
ScenarioGPT-4 onlyKlawty 5-tierSavings
Health check$0.15$0.00597%
Email triage$0.50$0.0786%
Content draft$1.50$0.4570%
Complex proposal$5.00$5.000%
Monthly (1,000 tasks)~$450~$4091%
GPT-4 estimates based on typical token usage per task type

The insight

Most agent tasks are simple. Routing 65% of work to sub-$0.30/M-token models drops your monthly bill from hundreds to tens of euros — without sacrificing quality on the tasks that actually need expensive models.

system reliability

Built to stay up

Circuit breakers, exponential backoff, 4-layer deduplication, and 60-second health monitoring. The system recovers from failures automatically.

system-health --metrics
MetricValueNote
Uptime24/7Mac mini, Luxembourg
Health check interval60sPer-service heartbeat
Circuit breaker threshold5 failuresConsecutive, per agent
Recovery backoff1h → 2h → 4h → 8hExponential
Agent cycle time15–30 minPer agent think cycle
Semantic memory vectors250+Qdrant, 3072-dim Gemini embeddings
SQLite databases4Tasks, CRM, tracker, Qdrant
Dedup layers4Task, channel, proposal, discovery
Context thread savings40–60%vs flat memory injection

Circuit breaker

Per-agent failure tracking with exponential backoff. 5 consecutive failures triggers isolation. Auto-resets on recovery.

4-layer dedup

Task-level, channel-level, proposal-level, and discovery-level deduplication. Zero spam, zero duplicate work.

Health monitor

60-second heartbeat checks on every service. Telegram alerts on failure. Auto-restart via LaunchAgent.

Proposal lifecycle

9-state machine with Sentinel validation, 15-minute rollback windows, and human override via Discord reactions.

Memory distillation

Daily extraction of insights into semantic vectors. 250+ Qdrant embeddings for long-term agent memory.

Daily backups

02:00 CET automated snapshots of all databases, agent identities, and memory files.

architecture

What's running

A single Mac mini running 8 coordinated agents, each with specialized tools, skills, and domain knowledge. No Kubernetes. No cloud functions. Just Node.js and SQLite.

Agents
8
coordinated
Tools
120+
across all agents
Skills
27
domain-specific
Channels
20+
Discord, email, Telegram...
Proposals/month
200+
with Sentinel validation
Codebase
25K+
lines of JavaScript
stack.txt
# Runtime
Node.js 22 · SQLite (WAL mode) · Qdrant Cloud
macOS · LaunchAgents · pm2 fallback
# LLM Providers
Kimi · Gemini · Claude · GPT-4 · OpenRouter
5-tier routing · 3 fallback chains · daily cost caps
# Channels
Discord (8 bots) · Gmail · Telegram · Webhooks
# Security
Sandbox · Policy engine · Prompt injection defense
chmod 600 credentials · Tool-level risk tiers · Sentinel watchdog

Disclaimer

All data from a real production deployment running since late 2025. Your results will vary based on agent count, task complexity, and model selection. Cost figures reflect API pricing at time of measurement and may change as providers update their rates. Klawty is model-agnostic — swap any tier to any provider via configuration.

Run your own benchmarks

Klawty is open source. Deploy it, measure it, and see what your agents actually cost.