Does an agent's memory get worse as it remembers more?
An agent that remembers keeps adding to its store forever. Each turn it retrieves the few memories nearest a query and answers from them. The natural worry is that the pile itself becomes the problem: as more memories accumulate, do the right ones still come back, or do they start to interfere with each other? This is a controlled measurement of that, run on a live Walrus-backed memory as it grows from 5 to 50 memories, and, in a follow-up run, to 200 (see the update at the end).
Problem
Retrieval-backed memory stores each fact as a vector and, on a query, returns the closest ones by distance. It is the standard way to give a language model long-term memory, and it has an obvious appeal: the model stays fixed, and what it "knows" about you is just whatever sits in the store. But the store only grows. A user who talks to an agent for a season leaves behind hundreds of opinions, many about the same handful of teams and players. The question this raises is whether retrieval stays reliable under that growth. Two distinct things could go wrong. The target memory could drift, coming back at a larger distance as the space fills. Or, even if the target's distance holds, other memories could crowd in around the query and out-rank it or ride along with it, so the agent recalls the wrong fact or mixes in noise. The first is a property of the embedding; the second is interference, and it is the one that actually bites a working agent. A pilot run on a small store showed recall is sharp when there is little to confuse it: restatements and paraphrases of a stored fact came back at distance 0.27 to 0.42 while unrelated topics sat past 0.83, with no overlap. The open question is whether that clean picture survives a crowded memory.
Hypotheses
Three predictions, chosen so the result can distinguish between them:
- H1, distance is stable. A target memory's distance to a matching query does not depend on how many other memories exist, because the embedding of a sentence is absolute, not relative to the store. Target distance should stay roughly flat as the store grows.
- H2, rank and margin degrade. Even with stable distance, the target's rank and the margin between it and the nearest competing memory shrink as the store grows, because more distractors fall near the query. Weaker queries (paraphrase, related) should degrade before exact restatements.
- H3, the threshold holds but the load grows. Because distance is stable (H1), a fixed distance cut that separated relevant from irrelevant at small N keeps doing so at large N. But the number of memories passing that cut grows, so a working agent cannot lean on the threshold alone; it also has to limit how many results it takes.
Method
The study runs against the live agent's real memory on Sui mainnet, writing through the same Walrus Memory SDK the product uses, so the numbers are from production conditions rather than a local mock.
- Needles. Five facts on distinct topics (France to win, Mbappé top scorer, England overrated, Germany distrusted, Messi decisive) are written first and never change. They are the memories I test recall for.
- Growth. Around them I add distractor facts in batches, taking the store to 5, 15, 30, and 50 memories. The distractors are deliberately mixed: some are near a needle (other opinions about the same team or player, the kind that genuinely competes in the embedding space) and some are far (other teams, or off-topic notes about cooking and travel). Later batches add more near-distractors, so interference, if it exists, has a fair chance to appear.
- Probes. At each store size, each needle is queried three ways: exact (restates the fact), paraphrase (same meaning, new words), and related (same topic, different question). That is 15 probes per checkpoint, 60 in all.
- Measures. For each probe I record the needle's rank in the results, its distance, the distance of the nearest competing memory, the margin between them, and how many memories of any kind fall under a 0.70 distance cut. Rank and margin test interference (H2); distance tests drift (H1); the under-threshold count tests load (H3).
- Conditions. A 25 second settle follows each batch of writes before probing, because a mainnet write can lag before it is queryable. The runner and raw per-probe data are open as
research/recall-at-scale.mjsandresearch/scale.jsonl.
Results
Recall stayed perfect for exact and paraphrase queries the whole way and broke down only for the vaguest queries as the store filled. The needle's own distance never moved; what changed was everything around it. One table per query type, tracking the store from 5 to 50 memories.
Exact queries (restating the fact)
| store size N | needle is rank 1 | avg distance | avg margin to nearest other | avg memories under 0.70 |
|---|---|---|---|---|
| 5 | 100% | 0.271 | 0.34 | 2.4 |
| 15 | 100% | 0.271 | 0.34 | 3.0 |
| 30 | 100% | 0.271 | 0.25 | 3.8 |
| 50 | 100% | 0.271 | 0.22 | 9.4 |
Paraphrase queries (same meaning, new words)
| store size N | needle is rank 1 | avg distance | avg margin to nearest other | avg memories under 0.70 |
|---|---|---|---|---|
| 5 | 100% | 0.430 | 0.16 | 3.0 |
| 15 | 100% | 0.430 | 0.16 | 3.6 |
| 30 | 100% | 0.430 | 0.15 | 4.6 |
| 50 | 100% | 0.430 | 0.12 | 11.0 |
Related queries (same topic, different question)
| store size N | needle is rank 1 | avg distance | avg margin to nearest other | avg memories under 0.70 |
|---|---|---|---|---|
| 5 | 100% | 0.466 | 0.16 | 3.2 |
| 15 | 100% | 0.466 | 0.16 | 4.2 |
| 30 | 80% | 0.466 | 0.12 | 5.0 |
| 50 | 60% | 0.466 | 0.04 | 12.2 |
"Avg memories under 0.70" is the count of stored memories of any kind that fall within 0.70 of the query, the pool a threshold would hand back. One honest note on the x-axis: mainnet writes lag, so the store that was actually queryable grew a little behind the nominal N. The under-0.70 column is the more faithful measure of how crowded the neighbourhood really was, and it is the one to watch.
Analysis
H1, distance is stable: confirmed, exactly. The average distance from a needle to its query is identical at every store size, 0.271 for exact, 0.430 for paraphrase, 0.466 for related, unchanged from 5 memories to 50. A memory does not move or fade as others pile up around it. Its position is a property of the text, not of the neighbourhood.
H2, rank and margin degrade: confirmed, and it hits the weakest queries first. The margin between a needle and its nearest competitor shrank steadily as the store grew, most for related queries, from 0.16 down to 0.04. Rank held perfectly for exact and paraphrase queries, which stayed rank 1 at every size, but related queries slipped: 100 percent rank-1 through N=15, then 80 percent at N=30, then 60 percent at N=50. The failures were exactly the predicted kind. By the last checkpoint a vague "how strong is the France squad" was edged off the top by another France opinion, and "how is Germany looking" fell to fourth behind a cluster of Germany notes. Specific restatements stay safe; loose, topical questions are where a crowded memory starts handing back the wrong neighbour.
H3, the threshold holds but the load explodes: confirmed. Because the needle's distance never drifts (H1), a 0.70 cut never once dropped a real memory. But the number of memories passing that same cut roughly quadrupled, from 2 to 3 at five memories to 9 to 12 at fifty. A threshold that returned a tight handful early returns a dozen candidates late. The cut still tells you what is plausibly relevant; it no longer tells you what is most relevant.
Limitations
- One embedding space and one memory backend; the absolute distances are specific to it, though the direction of any effect should generalise.
- A single run to 50 memories with five needles, not a multi-seed benchmark with error bars. Real stores reach thousands; this measures the onset of interference, not its asymptote.
- Synthetic, single-user facts written by the author, and the near/far and exact/paraphrase/related labels are my own judgement.
- Mainnet writes occasionally hit the 60 second cap and land late; a settle window mitigates but does not erase this.
Conclusion
Memories do not rot, but the space around them gets crowded, and that crowding, not any drift, is what degrades recall. The practical shape of it: an agent can trust an exact or near-exact match almost indefinitely, but as the store grows it should lean less on a distance threshold and more on taking only the top one or two results, because the threshold's pool keeps growing while the true answer stays put. For the vaguest queries, where even rank starts to fail, the honest fixes are the familiar ones from any retrieval system: a tighter cut, a small top-k, and eventually summarising or pruning old memories so fifty competing opinions about France collapse into one.
The encouraging part is that the failure is gradual and predictable rather than a cliff. At fifty memories the agent is still right for anything the user states plainly; interference shows up first exactly where you would expect, on the loosest questions, and it announces itself in a shrinking margin long before it changes an answer. That margin is a cheap thing to watch, and it is probably the right signal for deciding when a user's memory has grown enough to be worth pruning.
Update, June 11, 2026: the same question at 200 memories
The original run stopped at 50 memories, and the obvious objection is that a real store
keeps going. So I ran a second study on the same live mainnet memory, four times the size
and shaped like real usage instead of controlled probes. One synthetic user accumulates 200
memories over a simulated month: predictions, settled results, changes of mind, personal
details, all written sequentially through the production write path. All 200 writes
confirmed with mainnet blob ids. Six distinctive "needle" facts were planted among the
first 20 memories, things like a venue superstition and a dinner bet on Morocco. At
checkpoints of 25, 50, 100, 150, and 200 I probed for each needle with the kind of loose
question a user would actually ask weeks later ("is there any venue I avoid predicting?"),
which makes every probe a related query, the hardest class from the tables above. Each
checkpoint also put one of those questions to the full agent, to test whether the reply
still builds on the day-one fact rather than just retrieving it. Runner:
research/longhorizon.mjs; raw data: research/lh-checkpoints.json.
| store size N | needles recalled (of 6) | needles at rank 1 | agent reply built on the day-one fact |
|---|---|---|---|
| 25 | 6 | 4 | yes |
| 50 | 6 | 4 | yes |
| 100 | 5 | 4 | yes |
| 150 | 5 | 4 | yes |
| 200 | 5 | 4 | yes |
The picture from the first run holds at four times the size, and it stays specific. Every needle tied to a distinctive fact (the Vancouver rule, the Ronaldinho memory, the Japan habit, the goldfish promise) was still rank 1 at 200 memories, untouched by roughly 190 newer entries. The single recurring miss was the needle whose probe is genuinely ambiguous: "what do I have riding on Morocco" has to beat a store that by then holds many other Morocco opinions, and from 100 memories on it fell out of the top results. That is the related-query interference from the original tables, arriving on schedule and nowhere else. The behavioural probe never failed: at every checkpoint, including 200, the agent's actual reply was built on the day-one fact, and in conversation, where recall over-fetches and curates before answering, even the Morocco bet still came back at 200 deep.
The profile from this run is live, so the result can be checked by hand instead of taken on
faith: connect on the try page as
research-lh-fdd822 and ask it the probe questions yourself.