Honest evals and provenance for AI-agent memory

Project summary

Most of what gets sold as AI memory is vector search over old chat logs. It does not track where a fact came from, it cannot check it, and it cannot forget, so agents quietly fill up with things that are wrong and repeat them, and once agents read each other, a wrong fact can spread across a whole population. This project builds two things in the open: a reproducible test of whether a memory system propagates false or unverifiable claims, run across several systems and not only mine, and a small provenance and verification spec so an agent's memory can be audited, corrected, and forgotten. It grows out of my published research on geometric memory (SLoD, arXiv:2603.08965) and a live, open memory engine.

What are this project's goals? How will you achieve them?

The goal is an open, reusable way to measure whether agent memory systems spread false claims, plus a provenance spec the field can adopt. How: build a testbed that seeds a controlled false memory into a system and measures whether it propagates or is caught, with pinned, reproducible runs and multi-judge reporting so results are not judge-dependent. Then add provenance scoring, verification status, and feedback-decay and measure how much they contain propagation at fixed utility. Publish the testbed, the protocol, and the spec open-source. Falsifiable prediction: provenance-gating reduces propagation below an untreated baseline without hurting task performance.

How will this funding be used?

Minimum $15k: LLM API credits for the evaluation runs, compute for the graph and multi-agent experiments, and the open-source release of the eval and the provenance spec. Enough to get a real, public result. Full $50k (and up to about $80k): the same across more memory systems, a part-time engineer so the testbed is reusable by others rather than just my scripts, a larger multi-agent run, and proper documentation. Everything ships open either way.

Who is on your team? What's your track record on similar projects?

Eduard Izgorodin: software engineer with twenty years of production systems and formal training in mathematics; author of SLoD (arXiv:2603.08965); built the memory engine and the benchmark pipeline. Track record on exactly this kind of work: I found and removed oracle leaks in my own LoCoMo pipeline and published the lower, true numbers (about 65% retrieval recall out of the box, judge-free) instead of the flattering one, and I showed the LLM-judge alone swings the same answers by about 40 points, verified on a competitor's own answers. Olga Timoshina runs operations and coordination; we have worked together for over ten years. Live, mostly-unfunded infrastructure: an MCP memory server (MIT, official registry), a Core API, a Python SDK, a 3D visualization, and an OAuth connector working inside Claude.

What are the most likely causes and outcomes if this project fails?

The most likely ways this fails: the geometric and provenance approach may not contain false-memory propagation better than simple baselines; the honest eval may reveal that our own system has real weaknesses; or other researchers may be slow to adopt the testbed. If any of these happens, I publish the negative or limits result honestly rather than burying it. I have done this before: my Consistency Tax work includes a published negative result where a simple counting and LoRA baseline beat my own method on FEVER. Even a clean negative result advances the field, because right now nobody can measure this at all.

How much money have you raised in the last 12 months, and from where?

Nothing raised or received. The work has been self-funded. I have open applications submitted for this and closely related work (Foresight, BlueDot, Emergent Ventures, IMI Fellowship, grantmaking.ai, and the Long-Term Future Fund), but none has been decided yet. I would rather be transparent about all of it.