2026-01-02 · dcode · memory, architecture, qdrant

6-Tier Memory: How AI Agents Remember What Matters

How Klawty's 6-tier memory system prevents token waste and gives AI agents the right context at the right time — from SOUL.md identity to Qdrant vectors.

The problem with flat memory

Most agent frameworks dump everything into the prompt. Every conversation turn, every fact, every past decision — flattened into one massive context window. This is expensive and counterproductive. A 128K context window costs real money when you're running 8 agents on 15-minute cycles. Worse, LLMs perform worse with irrelevant context. More tokens does not mean better decisions.

We measured this in production. Flat memory injection produced prompts averaging 14,000 tokens. After switching to tiered memory with context threads, the same agents averaged 5,800 tokens — a 59% reduction — with better task completion rates.

The 6 tiers

Klawty organizes agent memory into six tiers, each with different persistence, search characteristics, and token costs:

Tier 1: SOUL.md — The agent's identity. Name, role, behavioral rules, communication style. Loaded on every cycle. Never changes during runtime. Think of it as the agent's DNA. Typically 200-400 tokens.

Tier 2: Context threads — The key innovation. Instead of injecting full conversation history, agents maintain structured threads tied to active tasks. Each thread contains only what's relevant to that specific task.

{
  "threadId": "task-847",
  "agent": "leila",
  "entries": [
    { "role": "observation", "content": "Client replied to invoice #2847 — requesting 30-day terms", "ts": "2026-01-02T09:15:00Z" },
    { "role": "decision", "content": "Flagged for Falco — payment terms change requires finance approval", "ts": "2026-01-02T09:15:02Z" },
    { "role": "handoff", "content": "Created task for Falco: review payment terms for client Meridian", "ts": "2026-01-02T09:15:03Z" }
  ]
}

When Falco picks up this task, the thread travels with it. No re-explaining. No lost context across agent handoffs.

Tier 3: Handoff notes — Structured messages between agents. When Leila hands a task to Falco, the note includes what was tried, what failed, and what the client expects. These are ephemeral — consumed and discarded after the receiving agent processes them.

Tier 4: MEMORY.md — Permanent agent knowledge. Lessons learned, client preferences, recurring patterns. Capped at 100 lines to prevent bloat. Updated by the memory distiller, not by agents directly.

Tier 5: Qdrant vectors — Semantic search over historical data. When an agent needs "what happened last time this client complained about delivery," it queries Qdrant with embeddings. Results are injected only when relevant, scored by similarity threshold.

Tier 6: Session logs — Raw JSONL files per day. Full audit trail. Never loaded into prompts directly — only queried by the memory distiller during its nightly run to extract insights for Tier 4 and Tier 5.

Why tiers matter

Each tier has a different cost profile. SOUL.md is ~300 tokens every cycle — unavoidable and worth it. Context threads add 200-800 tokens, but only for the active task. Qdrant results might add 500 tokens, but only when the similarity score exceeds 0.78.

The expensive tiers (5 and 6) are gated. An agent doing a routine health check never touches Qdrant. An agent handling a new client inquiry gets full semantic search. The memory system adapts to the task, not the other way around.

The result: 8 agents running 24/7, averaging $1.2/day in total LLM spend. Memory architecture is the single biggest lever for cost control in autonomous agent systems.