Benchmarks
unison-evals is a public, reproducible benchmark harness: 4 datasets (LongMemEval, LOCOMO, MemoryAgentBench, Context-Bench), black-box adapters, same metrics for every system.
Memory vendors publishing their own unreproducible numbers is how this category earned its benchmark disputes. Unison's answer is structural: unison-evals is a public harness where every system - Unison included - is a black box behind the same adapter contract, scored on the same datasets with the same metrics.
What ships (v0.1)
- 4 datasets: LongMemEval, LOCOMO, MemoryAgentBench, Context-Bench
- 2 tracks: agent-oracle (retrieval + reasoning over a corpus) and agent-e2e (full pipeline)
- Adapters: ~80 lines to add any system - point it at an API or CLI and run
Reproduce instead of trust
git clone https://github.com/unison-labs-ai/Unison-evals
# see README + METHODOLOGY.md for dataset setup and the exact judge configurationCurrent results, per-dataset methodology, judge prompts, and variance rules live in the repo - numbers belong next to the code that produced them. To benchmark your own memory system against the same bar, implement the adapter contract and open a PR; the hosted leaderboard is next.