Benchmarks

unison-evals is a public, reproducible benchmark harness: 4 datasets (LongMemEval, LOCOMO, MemoryAgentBench, Context-Bench), black-box adapters, same metrics for every system.

Memory vendors publishing their own unreproducible numbers is how this category earned its benchmark disputes. Unison's answer is structural: unison-evals is a public harness where every system - Unison included - is a black box behind the same adapter contract, scored on the same datasets with the same metrics.

What ships (v0.1)

4 datasets: LongMemEval, LOCOMO, MemoryAgentBench, Context-Bench
2 tracks: agent-oracle (retrieval + reasoning over a corpus) and agent-e2e (full pipeline)
Adapters: ~80 lines to add any system - point it at an API or CLI and run

Reproduce instead of trust

git clone https://github.com/unison-labs-ai/Unison-evals
# see README + METHODOLOGY.md for dataset setup and the exact judge configuration

Current results, per-dataset methodology, judge prompts, and variance rules live in the repo - numbers belong next to the code that produced them. To benchmark your own memory system against the same bar, implement the adapter contract and open a PR; the hosted leaderboard is next.

What ships (v0.1)

Reproduce instead of trust

On this page