Firdan Rifaldi
← Experiment

Does an agent's memory get worse as it remembers more?

AI memory Retrieval Interference June 9, 2026 Updated June 11, 2026

An agent that remembers keeps adding to its store forever. Each turn it retrieves the few memories nearest a query and answers from them. The natural worry is that the pile itself becomes the problem: as more memories accumulate, do the right ones still come back, or do they start to interfere with each other? This is a controlled measurement of that, run on a live Walrus-backed memory as it grows from 5 to 50 memories, and, in a follow-up run, to 200 (see the update at the end).

Problem

Retrieval-backed memory stores each fact as a vector and, on a query, returns the closest ones by distance. It is the standard way to give a language model long-term memory, and it has an obvious appeal: the model stays fixed, and what it "knows" about you is just whatever sits in the store. But the store only grows. A user who talks to an agent for a season leaves behind hundreds of opinions, many about the same handful of teams and players. The question this raises is whether retrieval stays reliable under that growth. Two distinct things could go wrong. The target memory could drift, coming back at a larger distance as the space fills. Or, even if the target's distance holds, other memories could crowd in around the query and out-rank it or ride along with it, so the agent recalls the wrong fact or mixes in noise. The first is a property of the embedding; the second is interference, and it is the one that actually bites a working agent. A pilot run on a small store showed recall is sharp when there is little to confuse it: restatements and paraphrases of a stored fact came back at distance 0.27 to 0.42 while unrelated topics sat past 0.83, with no overlap. The open question is whether that clean picture survives a crowded memory.

Hypotheses

Three predictions, chosen so the result can distinguish between them:

Method

The study runs against the live agent's real memory on Sui mainnet, writing through the same Walrus Memory SDK the product uses, so the numbers are from production conditions rather than a local mock.

Results

Recall stayed perfect for exact and paraphrase queries the whole way and broke down only for the vaguest queries as the store filled. The needle's own distance never moved; what changed was everything around it. One table per query type, tracking the store from 5 to 50 memories.

Exact queries (restating the fact)

store size Nneedle is rank 1avg distanceavg margin to nearest otheravg memories under 0.70
5100%0.2710.342.4
15100%0.2710.343.0
30100%0.2710.253.8
50100%0.2710.229.4

Paraphrase queries (same meaning, new words)

store size Nneedle is rank 1avg distanceavg margin to nearest otheravg memories under 0.70
5100%0.4300.163.0
15100%0.4300.163.6
30100%0.4300.154.6
50100%0.4300.1211.0

Related queries (same topic, different question)

store size Nneedle is rank 1avg distanceavg margin to nearest otheravg memories under 0.70
5100%0.4660.163.2
15100%0.4660.164.2
3080%0.4660.125.0
5060%0.4660.0412.2

"Avg memories under 0.70" is the count of stored memories of any kind that fall within 0.70 of the query, the pool a threshold would hand back. One honest note on the x-axis: mainnet writes lag, so the store that was actually queryable grew a little behind the nominal N. The under-0.70 column is the more faithful measure of how crowded the neighbourhood really was, and it is the one to watch.

Analysis

H1, distance is stable: confirmed, exactly. The average distance from a needle to its query is identical at every store size, 0.271 for exact, 0.430 for paraphrase, 0.466 for related, unchanged from 5 memories to 50. A memory does not move or fade as others pile up around it. Its position is a property of the text, not of the neighbourhood.

H2, rank and margin degrade: confirmed, and it hits the weakest queries first. The margin between a needle and its nearest competitor shrank steadily as the store grew, most for related queries, from 0.16 down to 0.04. Rank held perfectly for exact and paraphrase queries, which stayed rank 1 at every size, but related queries slipped: 100 percent rank-1 through N=15, then 80 percent at N=30, then 60 percent at N=50. The failures were exactly the predicted kind. By the last checkpoint a vague "how strong is the France squad" was edged off the top by another France opinion, and "how is Germany looking" fell to fourth behind a cluster of Germany notes. Specific restatements stay safe; loose, topical questions are where a crowded memory starts handing back the wrong neighbour.

H3, the threshold holds but the load explodes: confirmed. Because the needle's distance never drifts (H1), a 0.70 cut never once dropped a real memory. But the number of memories passing that same cut roughly quadrupled, from 2 to 3 at five memories to 9 to 12 at fifty. A threshold that returned a tight handful early returns a dozen candidates late. The cut still tells you what is plausibly relevant; it no longer tells you what is most relevant.

Limitations

Conclusion

Memories do not rot, but the space around them gets crowded, and that crowding, not any drift, is what degrades recall. The practical shape of it: an agent can trust an exact or near-exact match almost indefinitely, but as the store grows it should lean less on a distance threshold and more on taking only the top one or two results, because the threshold's pool keeps growing while the true answer stays put. For the vaguest queries, where even rank starts to fail, the honest fixes are the familiar ones from any retrieval system: a tighter cut, a small top-k, and eventually summarising or pruning old memories so fifty competing opinions about France collapse into one.

The encouraging part is that the failure is gradual and predictable rather than a cliff. At fifty memories the agent is still right for anything the user states plainly; interference shows up first exactly where you would expect, on the loosest questions, and it announces itself in a shrinking margin long before it changes an answer. That margin is a cheap thing to watch, and it is probably the right signal for deciding when a user's memory has grown enough to be worth pruning.

Update, June 11, 2026: the same question at 200 memories

The original run stopped at 50 memories, and the obvious objection is that a real store keeps going. So I ran a second study on the same live mainnet memory, four times the size and shaped like real usage instead of controlled probes. One synthetic user accumulates 200 memories over a simulated month: predictions, settled results, changes of mind, personal details, all written sequentially through the production write path. All 200 writes confirmed with mainnet blob ids. Six distinctive "needle" facts were planted among the first 20 memories, things like a venue superstition and a dinner bet on Morocco. At checkpoints of 25, 50, 100, 150, and 200 I probed for each needle with the kind of loose question a user would actually ask weeks later ("is there any venue I avoid predicting?"), which makes every probe a related query, the hardest class from the tables above. Each checkpoint also put one of those questions to the full agent, to test whether the reply still builds on the day-one fact rather than just retrieving it. Runner: research/longhorizon.mjs; raw data: research/lh-checkpoints.json.

store size Nneedles recalled (of 6)needles at rank 1agent reply built on the day-one fact
2564yes
5064yes
10054yes
15054yes
20054yes

The picture from the first run holds at four times the size, and it stays specific. Every needle tied to a distinctive fact (the Vancouver rule, the Ronaldinho memory, the Japan habit, the goldfish promise) was still rank 1 at 200 memories, untouched by roughly 190 newer entries. The single recurring miss was the needle whose probe is genuinely ambiguous: "what do I have riding on Morocco" has to beat a store that by then holds many other Morocco opinions, and from 100 memories on it fell out of the top results. That is the related-query interference from the original tables, arriving on schedule and nowhere else. The behavioural probe never failed: at every checkpoint, including 200, the agent's actual reply was built on the day-one fact, and in conversation, where recall over-fetches and curates before answering, even the Morocco bet still came back at 200 deep.

The profile from this run is live, so the result can be checked by hand instead of taken on faith: connect on the try page as research-lh-fdd822 and ask it the probe questions yourself.

See the agent → · Try it →