Personal AI

Federated retrieval: when RAG outgrows the laptop

Retrieval-augmented generation works well when the corpus fits on one machine. The honest version of what to do when the corpus outgrows that, without rebuilding the whole stack on cloud, is more interesting than either the all-local or all-cloud framings suggest.

Sid Smith

29 Sep 2025 • 6 min read

The retrieval-augmented generation pattern is well-understood at the small scale: index your corpus into a vector store, retrieve relevant chunks at query time, hand them to the model. Works great when the corpus fits in memory or on one machine. Works less well when the corpus is "everything I've ever written, every email I've ever sent, every document I've worked on", which for a serious user accumulates past the point where one machine handles it cleanly within five years.

Federated retrieval flow, query routed by classifier to hot/warm/cold indexes, results merged with normalized scores plus rerank, then returned to the model.

The honest version of what to do when this happens isn't "move it all to cloud" (the privacy and latency math that justified local in the first place still applies) and isn't "buy a bigger machine" (the bigger machine eventually has the same problem at a higher cost). It's federated retrieval, splitting the corpus across multiple stores, querying them in parallel, merging the results. Let me get concrete about what works and what doesn't.

One query, many retrieval sources.

One query, many retrieval sourcesUser querynatural languageaskRouterdecides which sourcesto askLocal vector DBprivate, fastTeam knowledgebasescopedCloud frontierbroad, slowMerge + rerankbest answers acrossall sourcesLocal first, cloud only when needed. The router is the budget.One query, many retrieval sources.

When the laptop becomes the constraint

A few specific points where single-machine RAG starts to feel cramped.

The corpus exceeds working memory. When the index doesn't fit in RAM and disk-backed vector queries become the bottleneck. For a corpus of a few hundred thousand documents, this is around the 32-64 GB indexed-size threshold depending on embedding dimensions and how much metadata you're carrying.

Reindex time becomes prohibitive. When updating embeddings (because you switched embedding models, or your data changed structurally) takes hours and forces a hot/cold cutover. At single-machine scale this is annoying; at corpus-of-everything scale it becomes a real operational problem.

Index loading at startup takes minutes. When restarting the inference stack (after a model swap, a config change, anything) means waiting for the index to come back online. Rare events become unpleasant.

The corpus naturally splits along access patterns. Some documents are queried many times a day; some are queried once a year. Treating them with the same hot-storage approach wastes resources on the cold ones. Treating them with the same cold-storage approach makes the hot ones slow.

These are the symptoms. The underlying cause is the same: the one-machine RAG pattern doesn't scale gracefully past a certain corpus size.

The federated pattern, in shape

Federated retrieval splits the corpus into multiple indexes, each living wherever it makes sense, and queries them in parallel.

Hot local index, the recent / frequently-accessed corpus. Lives on the inference machine itself or on a fast NAS volume. Optimized for query latency. Updated continuously.

Warm local index, the larger historical corpus. Lives on the NAS, queried over the local network. Optimized for capacity and throughput rather than absolute latency. Updated on a regular cadence.

Cold archive index, the very large historical or reference corpus. Lives wherever it fits, possibly on a different NAS volume, possibly in object storage, possibly in cloud. Optimized for capacity and cost. Queried rarely.

Federated query layer, receives the user's query, fans it out to the appropriate set of indexes (which depends on the query's likely target, recent work goes to hot, historical research goes to warm, archive lookup goes to cold), merges the results.

The pattern looks complicated written down. In practice it's a few hundred lines of glue code over standard vector stores.

What works in production

Concrete patterns from running variants of this on my home setup and seeing similar shapes elsewhere.

Per-store consistency over global consistency. Each index has its own update cadence, its own embedding model version, its own metadata schema. The federated layer handles the merge. Trying to keep everything globally consistent is more pain than it's worth.

Result merging by score normalization plus rerank. Different indexes return scores on different scales. Normalizing the scores against each store's distribution, then reranking the merged top-N with a smaller cross-encoder, beats trying to use the raw scores directly.

Query routing by classifier. A small classifier (well-served by a small local model) decides which indexes to query for a given user prompt. Sending every query to every index works but wastes resources; routing intelligently is most of the win.

Step-by-step updates per store. New documents get added to the hot store first. They migrate to warm on a schedule (weekly, monthly). They migrate to cold on a longer schedule (yearly). The migration is visible to the federated layer; older versions get dropped from the hotter stores.

Fallback semantics on store unavailability. When the cold index is offline (because the cold storage is asleep, the network is down, the cold service is restarting), the federated layer returns results from the available stores with a flag noting the partial coverage. The user sees "results from local; archive unavailable" rather than a hard failure.

What doesn't work

A few patterns that look attractive and fail in practice.

Single-tier "just use a bigger vector store." Throws hardware at the problem; doesn't address the access-pattern split. The hot queries get slow because they share infrastructure with the cold queries; the cold queries get expensive because they're on hot-tier storage.

Re-ranking with the inference model itself. Tempting because the inference model is right there. In practice the cost compounds badly, each query becomes "fan out, retrieve, ask the model to score, ask the model to answer." The model-as-reranker pattern works at small scale and gets expensive fast.

Cross-store deduplication at query time. Trying to dedupe the same document appearing in multiple indexes during query. Better to dedupe at index time (each document lives in one index based on its access pattern) than at query time (every query pays the dedupe cost).

Embedding-model variation across stores without versioning. Different stores using different embedding models is fine if you track which is which and account for the score differences. Without versioning, the merge gives you junk because the comparison isn't apples-to-apples.

Where this fits in the personal-AI architecture

For a serious personal-AI deployment, the federated retrieval pattern is the right answer once the corpus moves past a few hundred thousand documents. Below that, single-machine RAG with a good vector store is fine. Above it, federated.

The pattern also interacts cleanly with the memory-hygiene discipline, scoped retrieval against the right tier of index handles the privacy-and-relevance question better than single-tier scoped retrieval does. A "search my drafts" query goes to the hot local index; a "search the archive for that thing from years ago" query goes to the cold index; both stay within the secrets-isolation patterns the broader stack enforces.

The home-setup version

For the home AI setup I've been describing, the federated layout looks something like:

Hot index on the Mac Studio's local SSD, recent notes, current work-in-progress, the last quarter of correspondence. Few hundred thousand chunks. Updated continuously by a daemon.
Warm index on the Synology NAS, everything from the last few years. Tens of millions of chunks. Updated nightly.
Cold index on the Synology with a separate volume that spins down when idle, the archive. Hundreds of millions of chunks. Updated monthly.
Federated query layer running as a small service on the Mac Studio. Routes by query classification. Merges by normalized scores plus a small cross-encoder rerank.

Total infrastructure cost: zero beyond what the home setup already had. Operational cost: a few hours per quarter to handle index migration and embedding-model upgrades. Value: the assistant has access to the full corpus while keeping the hot queries fast and the cold-storage costs amortized.

For someone whose RAG setup is starting to feel cramped at single-machine scale:

Don't immediately reach for cloud. The federated-local pattern handles meaningful corpus sizes without leaving the local network.
Split by access pattern, not by content type. Hot/warm/cold based on how often the data gets queried, not based on what kind of data it is.
Build the federation layer thin. A few hundred lines of routing-merge-rerank code beats a heavy framework most of the time.
Track embedding-model versions per store. The mismatch case is the most common failure mode for federated retrieval; clear version tracking prevents it.
Plan for the cold tier to be unavailable sometimes. Graceful degradation beats hard failure.

Federated retrieval isn't exotic. It's the right pattern once the corpus outgrows the laptop. Most personal-AI users will hit this within a few years if they're using the assistant for real work; the architecture pattern that survives the growth is worth setting up before it becomes the binding constraint.