Personal AI

A local-first second brain: how I built mine

The practical version of the second-brain piece, what I actually run on Apple Silicon, how notes get captured, how search and RAG work against my personal corpus, which models do which jobs, and what the daily loop feels like once it's in place.

Sid Smith

20 Apr 2026 • 8 min read

The team-second-brain piece in January was about the pattern. The memory piece earlier this month was about the architecture. This one is the practical version, the actual second brain I run for myself, on the actual hardware I own, with the actual models doing the actual work. No abstractions. The thing as it sits.

The premise hasn't changed since I first wrote about a personal corpus three years ago. I want a single searchable store of everything I've written, read, decided, and noted, queryable by an AI that runs on my hardware, against a corpus that never leaves it. The premise was aspirational in 2023. In 2026 it's a weekend project that holds up in daily use.

The hardware path

The whole thing runs on four boxes I've written about before in the self-hosted stack post. For the second-brain specifically, three of them do the work.

core-01. Mac Studio M4 Max, 64 GB. The inference workhorse. Holds the bigger model weights, runs the embedding workers when a re-index is needed, hosts the vector store and the structured-memory database. Always on, wired to the switch, fans audible only when a re-embed is running.

node-01. Mac mini M4, 16 GB. The supporting cast. Runs the always-on small models, embeddings for streaming writes, the OCR pass for scanned PDFs, a small classification model that decides whether a captured note is worth indexing. Cheap, quiet, load-bearing.

laptop-01. MacBook Pro M4 Pro, 48 GB. The actual interface. Where I write, where I read, where the capture happens, where the queries originate. Has a local fallback inference setup so the assistant degrades gracefully when I'm off the home network.

store-01. Synology DS1019+. Backs all of it up. The vector store, the structured-memory database, the raw corpus, the embeddings, the model weights, all snapshotted on schedule, replicated off-site nightly.

No GPU server. No engine-01 in the closet. The Apple Silicon path is genuinely sufficient for a one-person second brain in 2026, and the absence of a separate GPU box is a feature, not a compromise.

Note capture

The capture layer is the part most second-brain projects underestimate. If capture is high-friction, the corpus dies a quiet death and you go back to whatever you were doing before.

What I actually use:

Plain Markdown in a single directory tree. Everything lives in ~/brain/ on laptop-01, organized by year and topic. Obsidian on top of it for the editing surface, but the foundation is plain files. The choice of editor stays reversible; the foundation doesn't.

A quick-capture hotkey. Command-shift-N drops a new note with a timestamp and an open cursor. Sub-second. The friction-floor for "I'd write this down if it were free" needs to actually be near free, or it doesn't get written down.

Voice capture. When I'm walking or driving, I dictate into a small app that pipes audio to node-01's local Whisper instance, which writes a transcript into the same ~/brain/ tree under a voice/ subdirectory. The transcript is rough but searchable, which is what matters.

Browser clippings. A small extension on the laptop saves whatever I'm reading (URL, title, my highlight, my note) to a clippings/ subdirectory. Same shape as everything else: Markdown file, timestamp, metadata at the top.

Email and calendar. The selectively-imported subset. Not every email; just the threads I've flagged for follow-up and the calendar entries that have substance. A small n8n workflow on store-01 pulls these on a schedule and writes them as Markdown into ~/brain/correspondence/ and ~/brain/calendar/.

Total daily capture friction: roughly zero. The corpus grows on its own as a byproduct of the work I'd be doing anyway.

The indexing pipeline

What sits between the corpus and the AI is the indexing layer, and the layer that took the longest to settle.

When a file lands in ~/brain/, an fswatch process on laptop-01 picks it up and ships it to core-01 over the LAN. core-01 does three things with it.

Chunk and embed. The file gets split into semantic chunks, paragraphs for prose, function/class boundaries for code, sensible sections for everything else. Each chunk gets embedded by a small open-weights embedding model running on node-01. The embeddings go into a local vector store on core-01.

Extract structured facts. A small extraction model (currently the Llama 4 8B distill) reads the chunk and pulls out any discrete propositions worth storing as structured memory. Most chunks contribute nothing. The bar is high, by design, for the reasons I went into in the memory architecture piece. What does get extracted lands in a small SQLite database alongside the vector store.

Index for keyword search. A plain Tantivy index keeps a full-text view of the corpus. Vector search is good at semantic similarity; keyword search is still better at "find me the literal phrase I wrote down" and the second brain needs both.

The indexing runs incrementally. A new note is searchable within a few seconds of being saved. A full re-index (when I change the embedding model or the chunking strategy) takes about forty minutes for the current corpus, which is around 12,000 documents and growing.

Search and RAG

The query path is where the second brain earns its keep. Three retrieval modes, picked by what I'm doing.

Keyword search. When I know the words. Tantivy returns matches in milliseconds. This is more useful more often than the AI-native framing suggests, most "I know I wrote that down somewhere" lookups are keyword lookups.

Vector search. When I know the shape but not the words. Query gets embedded, top-K chunks come back. Useful for "find me my prior thinking on the topic of X." Less useful for "find me the specific decision I made on Tuesday."

RAG-grounded answers. When I want synthesis, not retrieval. The query goes to a router on core-01, which pulls relevant chunks from both indexes, weights them with a recency decay, packs the top selection into context, and hands the whole thing to a local model for generation. The model writes an answer grounded in the chunks, with citations back to the source files. The interface is a small TUI on the laptop and a more polished one in the assistant chat.

The recency weighting matters. The vector store's notion of similarity is timeless; my actual use is overwhelmingly weighted toward what I wrote in the last few months. A slow decay over weeks and a faster one over months brings the retrieval distribution closer to what I'd reach for myself.

Model selection

Which model does which job is settled enough to write down.

Embeddings. A small open-weights embedding model on node-01, MLX-served. Embeddings are the workload where being on the LAN dominates everything else, and the small model is plenty good, the gain from a larger embedding model is consistently smaller than the gain from getting the chunking right.

Extraction. Llama 4 8B on core-01, called with a tight prompt and a JSON-schema response. The 8B is fast enough for incremental indexing and conservative enough to not flood the structured-memory store with noise.

RAG generation, default. Mistral 3 22B at 4-bit on core-01. Fast, smart enough for the bulk of synthesis work, doesn't make me wait. Same model I wrote up in the March laptop rundown as the daily-driver default.

RAG generation, heavy. Llama 4 70B at 4-bit on core-01. When the query is doing real synthesis across many sources, or when I need the answer to be unusually careful, the gateway routes here. Slower per token; better quality where it counts.

Reasoning. Kimi K2 Thinking 32B distill on core-01. The rare query where I want the model to think out loud (a planning question, a structured analysis over the corpus) goes here. Not the default; the specialist.

Hosted models do not see the corpus. Ever. That's a hard rule. The privacy posture I laid out in the privacy-by-design piece holds: the corpus is the asset, the corpus stays on the LAN, the corpus is what I protect.

The daily loop

What this actually feels like in use, on a normal Tuesday.

I open the laptop, I'm in the middle of something, I want to remember what I decided last week about a project I'm running. I type the question into the assistant chat. The router pulls chunks from the index, the local model writes an answer with three citations back to the source notes, the whole round-trip takes about four seconds. The answer is right because the corpus has the actual prior decision in it, not the model's guess at what I might have decided.

Mid-morning I'm reading a long PDF. I drag it into the brain directory, and while node-01 OCRs the scanned pages, core-01 chunks and embeds. By the time I'm done reading it, the document is queryable.

Afternoon I'm writing. I want to make sure I'm not repeating a point I made in an earlier piece. I run the draft through a small "find prior overlap" tool that queries the corpus and surfaces semantically-adjacent paragraphs. Most of them are fine. One of them is too close; I rework the passage.

End of the day I dictate a few notes on the walk home. By the time I'm back at the desk, the transcripts are in the corpus and findable.

The corpus is doing work. The AI is the interface to the corpus, not the source of the value. The model can be swapped; the corpus can't.

What I'd do differently if starting today

A short list, because the question comes up.

Start with the capture before the AI. A second brain with no notes is a clever architecture solving no problem. Get the capture friction to zero first. Run plain search over the result for a few weeks. Only then add the embedding-and-retrieval layer on top of a corpus that's already proving its keep.

Resist the temptation to embed everything immediately. Most notes never get retrieved. The vector store gets noisier the more it has in it. Prefer a curated subset that grows deliberately over a flood that grows automatically. I learned this the slow way.

Treat the structured-memory store as the more important asset. The vector store is the fashionable layer. The extracted-facts database is the one that punches above its weight, and the one most projects skip. It takes longer to dial in. It's worth the time.

Keep the foundation plain. Markdown files in a directory tree, version-controlled, plain-text. Every fancy layer above that (Obsidian, the indexer, the gateway, the chat surface) should be replaceable without losing the corpus. The corpus is the thing you can't rebuild.

The closing thought

The second-brain concept floated around for a decade before the local-AI foundation caught up to it. The patterns worked on paper and broke in practice, search wasn't smart enough, the cloud version compromised privacy, the local version compromised quality. In 2026 those compromises mostly aren't there. Apple Silicon got fast enough, the open-weights models got good enough, and the tools around indexing and retrieval matured.

The result, in my case, is a second brain that I actually use, not a project I'm proud of and never open, but a daily-driver tool that has changed what "I'll remember that" means for me. The remembering happens. The retrieval works. The synthesis is good enough that I trust it for first-pass thinking and verify it from the citations when the stakes are high.

It's not a polished product. It's a stack of small pieces, each doing what it's good at, sitting on hardware I already owned, against a corpus that's mine and stays mine. That's the version of the personal AI promise that I find actually keeps the promise.

Build the corpus. Make the retrieval boring. Pick the models that fit your hardware honestly. The rest is wiring.