Production benchmarks.
Real numbers from a real deployment. 8 agents, 1,000+ tasks per month, running 24/7 on a Mac mini in Luxembourg. No synthetic tests, no cherry-picked results — just what the system actually costs and how it actually performs.
91% cheaper than single-model
5-tier LLM routing sends each task to the cheapest model that can handle it. Health checks don't need GPT-4. Strategic decisions don't run on Qwen.
| Metric | Value |
|---|---|
| Monthly tasks completed | 1,018+ |
| Monthly LLM cost | ~€37 ($40.33) |
| Cost per task | ~€0.036 |
| Daily average | ~€1.23 |
| Coordinated agents | 8 |
| LLM providers used | 5 |
| Tier | Model | Cost/M tokens | % of tasks | Use case |
|---|---|---|---|---|
| Nano | Qwen 3.5 Flash | $0.07 | ~35% | Health checks, status, simple routing |
| Workhorse | DeepSeek V3.2 | $0.27 | ~30% | Email triage, CRM updates, routine tasks |
| Capable | Gemini 3 Flash | $1.25 | ~20% | Content drafting, analysis, SEO |
| Power | Kimi K2.5 | $5.00 | ~12% | Complex reasoning, proposals, multi-step |
| Premium | Claude Sonnet 4.6 | $15.00 | ~3% | Strategic decisions, BI reports, edge cases |
| Scenario | GPT-4 only | Klawty 5-tier | Savings |
|---|---|---|---|
| Health check | $0.15 | $0.005 | 97% |
| Email triage | $0.50 | $0.07 | 86% |
| Content draft | $1.50 | $0.45 | 70% |
| Complex proposal | $5.00 | $5.00 | 0% |
| Monthly (1,000 tasks) | ~$450 | ~$40 | 91% |
The insight
Most agent tasks are simple. Routing 65% of work to sub-$0.30/M-token models drops your monthly bill from hundreds to tens of euros — without sacrificing quality on the tasks that actually need expensive models.
Built to stay up
Circuit breakers, exponential backoff, 4-layer deduplication, and 60-second health monitoring. The system recovers from failures automatically.
| Metric | Value | Note |
|---|---|---|
| Uptime | 24/7 | Mac mini, Luxembourg |
| Health check interval | 60s | Per-service heartbeat |
| Circuit breaker threshold | 5 failures | Consecutive, per agent |
| Recovery backoff | 1h → 2h → 4h → 8h | Exponential |
| Agent cycle time | 15–30 min | Per agent think cycle |
| Semantic memory vectors | 250+ | Qdrant, 3072-dim Gemini embeddings |
| SQLite databases | 4 | Tasks, CRM, tracker, Qdrant |
| Dedup layers | 4 | Task, channel, proposal, discovery |
| Context thread savings | 40–60% | vs flat memory injection |
Circuit breaker
Per-agent failure tracking with exponential backoff. 5 consecutive failures triggers isolation. Auto-resets on recovery.
4-layer dedup
Task-level, channel-level, proposal-level, and discovery-level deduplication. Zero spam, zero duplicate work.
Health monitor
60-second heartbeat checks on every service. Telegram alerts on failure. Auto-restart via LaunchAgent.
Proposal lifecycle
9-state machine with Sentinel validation, 15-minute rollback windows, and human override via Discord reactions.
Memory distillation
Daily extraction of insights into semantic vectors. 250+ Qdrant embeddings for long-term agent memory.
Daily backups
02:00 CET automated snapshots of all databases, agent identities, and memory files.
What's running
A single Mac mini running 8 coordinated agents, each with specialized tools, skills, and domain knowledge. No Kubernetes. No cloud functions. Just Node.js and SQLite.
Disclaimer
All data from a real production deployment running since late 2025. Your results will vary based on agent count, task complexity, and model selection. Cost figures reflect API pricing at time of measurement and may change as providers update their rates. Klawty is model-agnostic — swap any tier to any provider via configuration.
Run your own benchmarks
Klawty is open source. Deploy it, measure it, and see what your agents actually cost.