Build a coding agent with memory

A practical architecture for a coding agent that gets smarter every session: the recall → act → verify → write-back loop, what to store, and the code per surface — with what the research actually shows.

A coding agent without memory re-learns your stack on every run. Give it a brain and each session starts where the last left off: it recalls prior decisions, builds on what worked, and avoids repeating fixes that failed.

This guide is a practical starting pattern, not a proven-optimal one — see What the research shows for the honest picture and the open questions.

The loop

The simplest effective shape is recall → act → verify → write-back:

User task

Recall relevant memory     GET /v1/brain/context

Read repo / files / tests

Plan and edit code

Run tests / lints  (verify)

Write durable lessons back  POST /v1/brain/ingest

Next session starts smarter

The verify step matters: only write back what actually held up (tests passed, the decision shipped). Memory that records unverified guesses degrades fast — see why recall returns weakEvidence.

A fixed "recall once at the start" is the easy version. The stronger pattern — and the one Unison's own agent runtime uses — is on-demand mid-turn retrieval: instead of front-loading one recall, the agent queries the brain as it works (when it hits an unknown, before a decision), via the MCP server tools or repeated context calls. Pair it with automatic write-back at the end of each turn so capture isn't something the model has to remember to do. This is why the Claude Code, Cursor, and Codex integrations hook recall and capture into the session lifecycle for you.

What to store

Memory should hold more than chat embeddings — the durable, reusable knowledge a teammate would carry between tasks:

decisions            — what we chose and why (so it isn't re-litigated)
architecture notes   — how the system fits together
prior bugs & fixes   — what broke and what actually fixed it
coding conventions   — patterns, naming, lint rules
test & build commands
deployment quirks
people & ownership   — who owns what
repo-specific workflows

In Unison these become documents, entities, and bitemporal facts — so a fact that changes (a decision reversed, an owner switched) supersedes the old one instead of returning stale with confidence.

The code

Recall before, write back after. Using the SDKs:

// TypeScript — @unisonlabs/sdk
import { BrainClient } from "@unisonlabs/sdk";
const brain = new BrainClient({ token: process.env.UNISON_TOKEN });

// 1. recall
const ctx = await brain.context({ q: taskDescription });
const memory = ctx.weakEvidence ? "" : ctx.contextMd;   // inject into the system prompt

// ... agent reads repo, edits, runs tests ...

// 2. write back what held up
await brain.ingest({
  items: [{ type: "conversation", sourceRef: runId, turns }],
});
# Python — unisonlabs SDK for recall; REST for conversation ingest
from unisonlabs import UnisonBrain
client = UnisonBrain()                       # reads UNISON_TOKEN
ctx = client.context(task_description)
memory = "" if ctx.weak_evidence else ctx.context_md

# write-back (the Python SDK has no ingest helper yet — call the endpoint)
import httpx, os
httpx.post("https://brain.unisonlabs.ai/v1/brain/ingest",
    headers={"Authorization": f"Bearer {os.environ['UNISON_TOKEN']}"},
    json={"items": [{"type": "conversation", "turns": turns, "sourceRef": run_id, "visibility": "workspace"}]})

Already in an agent harness? Skip the plumbing — the Claude Code / Cursor / Codex / MCP integrations wire recall + capture into the session for you, and LangGraph / Vercel AI SDK drop memory into a node or middleware.

What the research shows

Be honest about what's established and what isn't:

  • Passive recall is largely solved; decision-driving memory is not. Agents near-saturate older recall benchmarks (LoCoMo) yet drop to 40–60% on MemoryArena, which tests memory that must change what the agent does — exactly the coding-agent case. (MemoryArena, MemoryAgentBench, ICLR 2026)
  • On-demand / agentic retrieval beats front-loaded RAG. Letting the agent pull memory mid-task outperforms a single pre-generation recall on long-horizon evals. (LongMemEval-V2, survey: arXiv:2603.07670)
  • A dedicated memory layer beats stuffing context. SOTA memory systems hit competitive accuracy at ~7K tokens/query vs 25K+ for full-context — 3–4× cheaper. (AI memory benchmarks 2026, BEAM at 1M/10M scale)
  • Temporal correctness and write-back discipline are where systems fail. Stale facts returned "with confidence" are the common production failure; bitemporal supersession + verify-before-write mitigate it. (Zep/Graphiti, arXiv:2501.13956, State of AI Agent Memory 2026)

What this means for you: treat the loop above as a strong default, not gospel. The honest frontier is decision-driving recall and on-demand retrieval. We publish unison-evals — an open harness scoring every memory system on the same datasets, including a decision-driving test — so you can measure these tradeoffs on your own task instead of taking anyone's word for it.

On this page