production data

Production benchmarks.

Real numbers from a real deployment. 8 agents, 1,000+ tasks per month, running 24/7 on a Mac mini in Luxembourg. No synthetic tests, no cherry-picked results — just what the system actually costs and how it actually performs.

cost efficiency

91% cheaper than single-model

5-tier LLM routing sends each task to the cheapest model that can handle it. Health checks don't need GPT-4. Strategic decisions don't run on Qwen.

cost-overview.sh

Metric	Value
Monthly tasks completed	1,018+
Monthly LLM cost	~€37 ($40.33)
Cost per task	~€0.036
Daily average	~€1.23
Coordinated agents	8
LLM providers used	5

Data period: 30-day rolling average, production deployment

llm-routing --tiers

Tier	Model	Cost/M tokens	% of tasks	Use case
Nano	Qwen 3.5 Flash	$0.07	~35%	Health checks, status, simple routing
Workhorse	DeepSeek V3.2	$0.27	~30%	Email triage, CRM updates, routine tasks
Capable	Gemini 3 Flash	$1.25	~20%	Content drafting, analysis, SEO
Power	Kimi K2.5	$5.00	~12%	Complex reasoning, proposals, multi-step
Premium	Claude Sonnet 4.6	$15.00	~3%	Strategic decisions, BI reports, edge cases

Models are swappable via openclaw.json — these are current defaults

cost-compare --naive-vs-smart

Scenario	GPT-4 only	Klawty 5-tier	Savings
Health check	$0.15	$0.005	97%
Email triage	$0.50	$0.07	86%
Content draft	$1.50	$0.45	70%
Complex proposal	$5.00	$5.00	0%
Monthly (1,000 tasks)	~$450	~$40	91%

GPT-4 estimates based on typical token usage per task type

The insight

Most agent tasks are simple. Routing 65% of work to sub-$0.30/M-token models drops your monthly bill from hundreds to tens of euros — without sacrificing quality on the tasks that actually need expensive models.

system reliability

Built to stay up

Circuit breakers, exponential backoff, 4-layer deduplication, and 60-second health monitoring. The system recovers from failures automatically.

system-health --metrics

Metric	Value	Note
Uptime	24/7	Mac mini, Luxembourg
Health check interval	60s	Per-service heartbeat
Circuit breaker threshold	5 failures	Consecutive, per agent
Recovery backoff	1h → 2h → 4h → 8h	Exponential
Agent cycle time	15–30 min	Per agent think cycle
Semantic memory vectors	250+	Qdrant, 3072-dim Gemini embeddings
SQLite databases	4	Tasks, CRM, tracker, Qdrant
Dedup layers	4	Task, channel, proposal, discovery
Context thread savings	40–60%	vs flat memory injection

Circuit breaker

Per-agent failure tracking with exponential backoff. 5 consecutive failures triggers isolation. Auto-resets on recovery.

4-layer dedup

Task-level, channel-level, proposal-level, and discovery-level deduplication. Zero spam, zero duplicate work.

Health monitor

60-second heartbeat checks on every service. Telegram alerts on failure. Auto-restart via LaunchAgent.

Proposal lifecycle

9-state machine with Sentinel validation, 15-minute rollback windows, and human override via Discord reactions.

Memory distillation

Daily extraction of insights into semantic vectors. 250+ Qdrant embeddings for long-term agent memory.

Daily backups

02:00 CET automated snapshots of all databases, agent identities, and memory files.

architecture

What's running

A single Mac mini running 8 coordinated agents, each with specialized tools, skills, and domain knowledge. No Kubernetes. No cloud functions. Just Node.js and SQLite.

Agents

coordinated

Tools

120+

across all agents

Skills

domain-specific

Channels

20+

Discord, email, Telegram...

Proposals/month

200+

with Sentinel validation

Codebase

25K+

lines of JavaScript

stack.txt

# Runtime

Node.js 22 · SQLite (WAL mode) · Qdrant Cloud

macOS · LaunchAgents · pm2 fallback

# LLM Providers

Kimi · Gemini · Claude · GPT-4 · OpenRouter

5-tier routing · 3 fallback chains · daily cost caps

# Channels

Discord (8 bots) · Gmail · Telegram · Webhooks

# Security

Sandbox · Policy engine · Prompt injection defense

chmod 600 credentials · Tool-level risk tiers · Sentinel watchdog

Disclaimer

All data from a real production deployment running since late 2025. Your results will vary based on agent count, task complexity, and model selection. Cost figures reflect API pricing at time of measurement and may change as providers update their rates. Klawty is model-agnostic — swap any tier to any provider via configuration.

Run your own benchmarks

Klawty is open source. Deploy it, measure it, and see what your agents actually cost.

Get Managed Hosting Read the Docs