Why I Built
MemoryStress.
Every AI memory system claims high recall. None have been tested at 1,000 sessions. So I built the benchmark that does.
The first benchmark that measures what happens when memory systems age.
OMEGA scores 95.4% on LongMemEval. That number measures recall across 40 static sessions. But here's the question nobody is asking: what happens at session 500? At session 1,000? When your memory store has ingested ten months of daily conversations and must still find a fact mentioned once, six months ago?
No existing benchmark answers this. So I built one.
MemoryStress is a longitudinal memory benchmark - 583 facts embedded naturally across 1,000 GPT-4o-generated conversation sessions spanning 10 simulated months. It tests retention under accumulation pressure, contradiction chains, cross-agent handoffs, and the slow entropy that destroys memory systems over time.
The Gap in Memory Benchmarks
Every memory system on the market publishes LongMemEval scores. Mastra claims 94.87%. OMEGA claims 95.4%. But LongMemEval tests recall from ~40 clean sessions with no accumulation pressure, no eviction, and no multi-agent complexity. It's a great test of retrieval quality. It tells you nothing about what happens when memory ages.
| Benchmark | Sessions | What It Misses |
|---|---|---|
| LongMemEval | ~40 | No accumulation pressure |
| MemoryAgentBench | Short | No degradation curves |
| BEAM | Synthetic | No realistic noise |
| MemoryStress | 1,000 | First longitudinal benchmark |
The architectural question MemoryStress exposes is this: systems that compress all memories into a fixed-size context window (Mastra's Observational Memory, MemGPT's context packing) hit a ceiling when the data exceeds that window. At session 200, maybe session 300, they're forced to evict or summarize - and old facts start disappearing.
Persistent architectures like OMEGA's don't have this problem. SQLite doesn't run out of context. The question is whether retrieval degrades as the store grows. MemoryStress measures exactly that.
How MemoryStress Works
The benchmark runs in three phases, each designed to add more pressure to the memory system:
Phase 1: Foundation
Sessions 1–100Clean, low noise. Core facts are established. This is the baseline — if you can't recall facts from here, your system has a fundamental problem.
Phase 2: Growth
Sessions 101–500Volume increases. Some contradictions appear. Topics multiply. This phase simulates a few months of real usage where the memory store grows significantly.
Phase 3: Stress
Sessions 501–1,000Dense, high-entropy, multi-topic sessions. Facts compete for retrieval space. Contradictions chain. This is where compression-based systems would cliff.
At phase boundaries, the benchmark asks 300 questions across 7 types: fact recall, temporal ordering, preference recall, contradiction resolution, single-mention recall, cross-agent recall, and cold start recall. Each question is graded by GPT-4o using type-aware prompts.
The Degradation Curve
This is the key metric. Not the absolute score - the shape of the curve as sessions accumulate:
The Phase 2 peak at 42.4% is the important signal. OMEGA's persistent architecture means more data actually helps retrieval - a richer embedding space produces better semantic matches. The Phase 3 dip is noise dilution, not data loss. The memories are still there; they're just harder to find in a larger store.
A compression-based system would show a different shape entirely: flat or rising through Phase 1 as the context window fills, then a steep cliff at the point where eviction begins. Early facts don't gradually get harder to find - they're gone.
Is 32.7% Good?
Yes - for what this benchmark tests. MemoryStress asks questions about facts buried in noisy conversations from hundreds of sessions ago, including single-mention facts, contradicted facts, and cross-agent facts. A null adapter (always answers “I don't know”) scores 0%. A raw context-window approach would hit its token ceiling around session 200 and fail everything after that.
For reference, OMEGA scores 95.4% on LongMemEval - which tests recall from ~40 clean sessions. MemoryStress is 25× the session volume with adversarial conditions. The absolute number will go up as I optimize, but the benchmark is calibrated to be hard enough that it reveals real architectural differences.
Per-Type Breakdown
Seven question types expose different failure modes. The spread between best (41.2%) and worst (21.4%) tells you exactly where the retrieval pipeline succeeds and struggles:
| Question Type | Score |
|---|---|
| Temporal ordering | 41.2% |
| Fact recall | 37.5% |
| Cold start recall | 37.5% |
| Preference recall | 37.1% |
| Cross-agent recall | 31.2% |
| Single-mention recall | 27.7% |
| Contradiction resolution | 21.4% |
Contradiction resolution (21.4%) is the hardest category. The LLM retrieves both old and new versions of a fact, and despite strong prompting to prefer the most recent, sometimes picks the wrong one. This is a fundamental retrieval+reasoning problem that every memory system must solve.
The Optimization Journey
I iterated through five configurations, each testing a different retrieval strategy. The baseline was 27.3%. I improved to 32.7% by combining multiple techniques:
The five techniques that contributed:
Contradiction-Aware RAG Prompt
Explicitly instructing the LLM: "when multiple notes discuss the same topic, ALWAYS use the MOST RECENT note." Notes are sorted chronologically (oldest→newest) so recency is structurally communicated. +5 correct answers.
Query Augmentation
Using gpt-4.1-mini to generate 3 alternative search queries per question, then merging results from all retrieval passes. This gives single-mention facts more chances of matching, since the original question's wording may not overlap with the stored conversation. +4 correct.
Recency Boosting
A 1.0→1.8× multiplicative boost to retrieval relevance scores based on note date. Adapted from OMEGA's LongMemEval optimizations. +3 correct.
Fact Extraction at Ingest
Extracting discrete facts from session conversations and storing them as separate memory entries creates additional semantic hooks. The benefit is diffuse rather than targeted. +3 correct overall.
Cross-Agent Fallback
When agent-scoped retrieval returns fewer than 5 results, a secondary unscoped pass catches facts planted by other agents. +2 correct.
The Architectural Insight
MemoryStress was designed to expose a specific failure mode: what happens when your memory architecture can't scale beyond a fixed context window.
Systems like Mastra's Observational Memory pack all memories into a ~70k token context. At session 100, that might be fine. At session 500, you're compressing aggressively. At session 1,000, you're throwing information away. The degradation isn't graceful - it's a cliff.
OMEGA's persistent vector store doesn't have this failure mode. Nothing is evicted. The trade-off is that retrieval gets harder as the store grows - which is exactly what the Phase 3 dip shows. But harder-to-find is fundamentally different from gone. You can improve retrieval. You can't recover evicted memories.
Cost Transparency
The full benchmark run costs $4.06 using GPT-4o for generation, answering, and grading. That's 4¢ per correct answer.
Critically, the cost scales linearly with sessions, not quadratically. Compression-based architectures that regenerate their entire context block on every cycle face quadratic cost growth as the memory store expands. OMEGA's retrieval-based approach stays constant per query regardless of store size.
Run It on Your System
MemoryStress is open source. You can generate the dataset, write an adapter for your memory system, and see your own degradation curve. Here's how:
The harness outputs a degradation curve, per-type breakdown, and full metrics. Write an adapter that implements store() and query() for your system and you're done.
- Jason Sosa, builder of OMEGA
Related reading
MemoryStress is part of the OMEGA project. Apache 2.0 licensed.