Building an AI assistant that actually remembers

Memory is the hard part of building a personal AI assistant. The model is mostly a solved problem at this point; what separates a daily-driver assistant from a glorified chat window is whether it remembers what matters and forgets what doesn't. Here's how I think about the architecture.

Building an AI assistant that actually remembers

I've been running my own personal AI assistant long enough to have a working answer to the question I see most in the personal-AI conversation: what actually makes the difference between a useful daily AI and a glorified chat window? The model? The prompt? The integrations?

It's none of those. It's memory.

The model is mostly a solved problem at this point. Local models that fit on my laptop are good enough for the vast majority of what I need, and when they're not. I have a clean fallback path to a hosted one. I've written about this before. The prompt is a tractable problem too, there's enough public knowledge about prompting that you can get to a decent system prompt in an afternoon. Integrations are infrastructure work. Annoying, but bounded.

Memory is the hard part. And after a year of step-by-step work, I think I have a clear enough picture of the setup to write it down.

What memory actually has to do

Strip away the framing and the assistant has four jobs in the memory dimension:

  1. Remember the things about me that don't change much. Name, role, the people I work with, the projects I'm running, my preferences about how I work.
  2. Remember the things I tell it that I'm going to want back later. Decisions, references, the offhand "remind me about this" stuff.
  3. Remember the recent conversation well enough that the next message lands in context.
  4. Forget the things that aren't worth keeping, on a sensible timeline, without me having to manage it.

Every memory setup I've seen in personal-AI projects, including mine across several rounds, is some configuration of those four. The interesting design choices are how each one is stored, when each one is retrieved, and how they interact when the assistant is composing a response.

The four layers I actually use

I run four memory layers. They're not novel individually (every layer here exists in some form in the literature) but the combination and the routing between them is where the work lives.

Layer one: structured facts. A flat YAML file with the things about me that are stable enough to write down. The people I work with, the projects, my preferences, my schedule shape, the things I want the assistant to always know without retrieval. This loads into the system prompt on every request. It's maybe 4KB. It's the cheapest, most reliable layer and it gets the most leverage per token of anything in the system.

Layer two: recent conversation buffer. The last N turns of whatever we're currently talking about, stored verbatim. This is what keeps the assistant coherent inside a session. Once the buffer gets too long, it gets summarized down to a single paragraph and the verbatim turns get rotated out. The summarization is its own prompt and it took me longer than I'd like to admit to get it right.

Layer three: vector store of past content. Conversations, documents, notes (anything I've fed the assistant or that it's generated with me) gets chunked, embedded, and stored. When a new message comes in, the message gets embedded too, similar chunks get pulled back, and the top few get injected into context. This is the "did we talk about this before" layer (this is called RAG, if you want to look it up later). It's the layer that's the most magical when it works and the most embarrassing when it doesn't.

Layer four: long-term structured memory. This is the one most projects skip and the one I think matters most. Explicit facts that the assistant has learned about me over time, written down as discrete propositions, indexed and retrievable. Not the raw conversation; the extracted decisions and preferences and named entities. "Sid prefers Tuesday afternoons for deep work." "The project Sid calls 'helix' is the personal AI assistant." "Sid's daughter is named X." Structured, dated, editable.

Why four layers, not just a bigger vector store

I tried the just-a-vector-store version first. It's the version most tutorials show you and it's the version most off-the-shelf personal-AI products are doing under the hood. It doesn't work. Or rather, it works in the demo and falls apart in daily use.

The failure mode is specific. Vector retrieval is good at "find me semantically similar content" and bad at "remember the specific thing I told you yesterday." If I tell the assistant on Monday that I'm going to be in Seattle Friday, and on Wednesday I ask "what's the weather where I'll be at the end of the week," the vector store has to surface the Monday turn from a query that doesn't share much surface vocabulary with it. Sometimes it works. Often it doesn't. And the times it doesn't are the times the assistant feels broken in exactly the way that makes you stop using it.

The structured fact layer solves the "stable things about me" problem cleanly because there's no retrieval involved, those facts are always present.

The recent buffer solves the within-session coherence problem because the relevant turns are right there in context.

The vector store does what it's good at: pulling back semantically related content from a deeper history.

And the long-term structured memory catches the things the vector store keeps missing, the discrete, specific, dated facts that need to be retrievable by what they are rather than what they sound like.

The four layers compose. The cost of running all of them is much lower than the cost of trying to get one of them to do everything.

The parts that will bite you inside each layer

A few specific things that took me much longer than expected.

Writing to long-term structured memory is the hardest part of the whole system. The naive version ("after each conversation, extract facts and store them") produces noise faster than signal. Most conversations don't contain stable facts. Most extracted "facts" are wrong, redundant, or already known. I spent months tuning the extraction prompt and the deduplication step before the long-term store became net positive instead of net distracting. The lesson: be very conservative about what gets written. Most things should not be remembered. The bar for "this is a stable fact worth storing" is higher than your intuition will tell you.

Conversation summarization is its own discipline. Rolling a long thread into a paragraph that the next message can usefully build on is a real prompt-engineering problem. The summary needs to keep the names, the decisions, the open questions, and the emotional register, and it needs to drop everything else. I rewrote the summarization prompt seven times. The version that works has plain instructions about what to keep and what to drop, and a few-shot example of a good summary of a typical conversation shape.

Recency is a feature, not a parameter. The thing I knew two days ago is more relevant than the thing I knew two months ago, almost always. The vector store doesn't natively know that, it ranks by similarity. I ended up adding a plain recency weighting on retrieval that decays slowly over weeks and faster over months. The weighting matters more than the embedding model choice. Counterintuitive, but consistently true in my testing.

Forgetting is a feature too. The default state of every memory layer is "keep everything forever," and that's the wrong default. The structured facts layer needs plain edits and deletions. The vector store needs a pruning pass that drops chunks that have never been retrieved and are old enough to be stale. The long-term structured memory needs a periodic review where I look at what got written and delete the things that shouldn't have been. Without forgetting, the system gets noisier over time, not better.

What this gets you that a chat interface doesn't

The difference, in daily use, is that the assistant feels like it knows me. Not in a creepy way, not in a "it has my whole life" way. In a "I don't have to re-explain context every time I open a session" way.

I can say "what was that thing I asked you about last week with the deployment script" and it pulls the conversation back. I can say "remind me what I decided about the Q2 planning rollup" and it has the decision because the decision got extracted to the long-term store. I can say "set up the usual Tuesday morning routine" and it knows what the usual Tuesday morning routine is because that's in the structured facts file.

None of those are model capability gaps. They're memory architecture decisions. The model that powers the assistant is the same on Tuesday as it was on Monday. What changed is what the model has access to when it composes a response.

This connects back to the encoding-a-person thread I've been pulling on for a while. The model isn't what makes the assistant feel like it's yours. The accumulated, curated, structured representation of you is. The memory is the thing that gets close to encoding. The model is the engine that operates on it.

Where I'd start if I were starting again

A few specific recommendations, for the people I know who are trying to build the same kind of thing.

Start with the structured facts file. Write down the things about you that don't change. Put it in the system prompt. This single change does more for assistant quality than almost anything else and it takes an hour. If you don't get any further than this, you've still meaningfully upgraded what a baseline chat interface does.

Add the conversation buffer and summarization next. Not the vector store. The buffer is what makes the assistant coherent within a session, and most of the daily value of an assistant is within-session value. The vector store comes later and matters less than the tutorials make it sound.

Build the long-term structured memory with the dial set to conservative. Extract sparingly. Review what got written. Delete aggressively. The store gets useful when it's curated, not when it's full.

Save the vector store for last. It's the most fashionable layer and the layer with the highest expectation-to-actual-value gap. It's worth having. It's not worth starting with.

Treat privacy as a first-class design constraint. Everything I've described here lives on my machine. The structured facts, the conversation history, the vector store, the long-term memory, all local. The model can be local or hosted, but the memory stays mine. I've written about privacy by design for personal AI and the local-first second brain pattern, and both apply here directly. The memory is the asset. The memory is what you protect.

The closing thought

Most of what's marketed as "AI assistant" right now is a chat interface with a thin wrapper of integrations. The product feels stateless because it mostly is. Each session starts from zero and the assistant has to be told who you are and what you're doing, every time.

The version that actually works in daily life (the one I now genuinely use as a daily driver) is the one with memory built in from the ground up. Four layers, each doing what it's good at, composed carefully, with forgetting treated as a first-class operation and writing treated as conservatively as reading.

It's more work than the demos suggest. It's also the only path I've found to an assistant that's worth the trust you put in it. The model gets you to "useful sometimes." The memory gets you to "useful daily." That's the gap that matters, and it's the gap most projects aren't crossing.

Build the memory. The rest is comparatively easy.