Personal AI

Encoding a person: what would it actually take?

If the goal is a working representation of how a specific person reasons in their domain, what's the actual stack? Walking through corpus, model, retrieval, eval, and where each piece breaks today.

Sid Smith

17 May 2023 • 7 min read

The earlier posts in this thread were about whether the foundation exists. The knowledge-as-a-service thought experiment sketched what the artifact might look like; style-is-not-knowledge pulled apart the easier capability from the harder one; the marketplace piece walked through the supply-side gap. They covered the why and the wouldn't-it-be-interesting-if. This post is the engineering question, assume you actually want to build something that encodes how a specific person reasons in their domain. What does the stack look like? Where does each layer break? What's a defensible MVP?

The honest answer is "nobody has built this end-to-end and the open problems outnumber the solved ones." That said, the components have moved fast enough in the last six months that walking through the stack is worth doing, partly to clarify what's actually within reach today, partly to mark what isn't.

What "encoding a person" needs to mean

Before the stack, the definition. Three different things often get bundled under "personal AI":

A model that writes in a person's voice, style transfer, mostly solved at demo level (covered in the GPT-4 first-impressions post)
A model that knows what a person has written, retrieval over a corpus, mostly an integration problem
A model that reasons the way a person reasons about novel situations in their domain, mostly an open problem

This post is about the third thing, with the first two as components. The goal: a system that, given a question the source person hasn't published an answer to, produces an answer that the source person would consider directionally correct and meaningfully shaped by their actual judgment, not just generic LLM output in their cadence.

That's a hard goal. Most of the difficulty is in the last clause.

The stack, layer by layer

Layer 1, the corpus

Everything starts with the corpus. The first reflex is "use their published writing", blog posts, articles, books, talks. That's the obvious tier. It's also insufficient.

A working corpus probably looks like this, in tiers of decreasing quality and increasing cost-to-collect:

Tier 1, published, finished work. Blog posts, articles, books, talks transcripts. High quality, low quantity for most people, easy to collect. Captures considered positions. Misses everything off-the-cuff.

Tier 2, public conversations and Q&A. Podcast appearances, conference Q&A, Twitter threads, forum participation, StackOverflow answers. Lower formality, higher volume, more revealing of how the person reasons in real time. Harder to collect but increasingly available.

Tier 3, semi-public artifacts. Slide decks, internal-but-shareable docs, code repositories, email lists they've contributed to. This is where the actual day-to-day reasoning lives. Hard to collect because it's scattered and rights are murky.

Tier 4, instrumented capture. Going forward from a starting date: structured interviews, decision logs, "think-aloud" sessions where the person walks through novel problems with the goal of capturing the reasoning. This is the only tier that can capture reasoning the person hasn't already externalized. It's also the only tier that requires their active participation.

Most of the early experiments in this space have been Tier-1-only. That's why the outputs feel like style with no substance. The substance lives further down the tiers.

Layer 2, the base model

The base model choice is a tradeoff between capability, openness, and ability to fine-tune.

The closed frontier models (GPT-4, Claude) are the most capable but the least amenable to deep customization. As of mid-2023, OpenAI doesn't let you fine-tune GPT-4. You can do in-context conditioning with up to 8K (or 32K, for some access tiers) tokens, which lets you cram in a fraction of a real corpus. You cannot bake the corpus into the weights.

The open-weights models (LLaMA, Falcon, the various derivatives that have shown up since the leak) are less capable per parameter but fully customizable. You can fine-tune. You can run them on your own hardware. You can keep the resulting model as a private artifact.

For an "encode a person" system, the open-weights path is the only viable one in the medium term. The closed providers haven't built the surfaces you'd need, and waiting for them to is leaving the work on the table. The LLaMA leak from a couple months back is what made this branch viable at all, without it, none of the rest of this post would be writable.

For the experiment, my best guess is: LLaMA-13B base, with the option to upgrade to 33B if it lands in a usable state. Smaller is faster to iterate on; bigger gets better results once you have the pipeline working.

Layer 3, adaptation

Two main techniques for adapting a base model to a specific corpus, and they solve different problems:

Full fine-tuning. Update all the weights on the corpus. Most expensive, most powerful, requires the most data. The model genuinely learns the distribution of the source material.

LoRA-style adapters. Train a small set of additional weights that modify the base model's behavior. Cheap, fast, composable. Doesn't change the base model, you can swap adapters in and out, stack them, share them.

For this use case, the right answer is probably adapters, for a few reasons. First, the corpora most people have are too small for productive full fine-tuning of a multi-billion-parameter model. Second, adapters are portable, the artifact you're producing is a small file that captures the personalization, with the base model being a separate dependency. Third, the licensing story is cleaner, your adapter doesn't redistribute the base model.

The catch is that adapters are mostly good at style and surface patterns. Getting them to capture deeper reasoning is more of an art than a science right now. This is where the technique stack runs out and the open research starts.

Layer 4 (retrieval

The model alone) base plus adapter, has whatever pattern signature the corpus imprinted, but it doesn't have access to the actual facts in the corpus during inference. For that you need retrieval.

The standard pattern is: chunk the corpus, embed each chunk with an embedding model, store the embeddings in a vector database (Pinecone, Weaviate, Chroma, pgvector, pick one), and at inference time embed the user query, retrieve the top-K most similar chunks, and inject them into the model's context as grounding.

This is the part of the stack that's most mature. The infrastructure works. The integration patterns are documented. RAG (retrieval-augmented generation) has been a working technique for two years and the tooling reflects that.

The catch is that retrieval grounds the model in what the corpus literally contains, it doesn't help the model generalize from the corpus to novel situations. If the user asks something the corpus addresses, retrieval helps. If the user asks something the corpus doesn't address, retrieval pulls in irrelevant chunks and the model falls back to its base behavior.

Layer 5, evaluation

This is the layer that nobody has solved.

The question "did the system answer this novel question the way the source person would have" requires either:

The source person to evaluate it themselves (doesn't scale, also creates a circularity problem because they're judging a system trained on their own output)
A held-out set of questions where you have the source person's real answer and can compare (works in narrow domains, hard to construct in broad ones)
Some proxy metric, output style similarity, factual accuracy on corpus-grounded questions, reasoning-step quality (each captures a slice; none captures the whole)

Without evaluation, you can't tell if you're making progress. You can build the rest of the stack and have no way to know if your improvements are actual improvements. This is the hardest part of the problem and it gets the least attention because it's not as fun as model engineering.

A concrete MVP

If I were to actually build this (if someone were to actually build this) the smallest defensible version looks like:

Corpus: Tier 1 plus as much Tier 2 as can be scraped. A few hundred thousand to a few million tokens, depending on the source.
Base model: LLaMA-13B running locally. Quantized for inference performance.
Adapter: LoRA fine-tune on the full corpus, optimizing for next-token prediction. Standard recipe, no fancy techniques.
Retrieval: Chunk the corpus, embed with a standard embedding model (text-embedding-ada-002 if you're paying OpenAI; an open alternative if not), store in Chroma or pgvector.
System prompt: Frame the persona, set guardrails (don't make up positions on questions outside the corpus), instruct the model to ground answers in retrieved chunks.
Evaluation harness: Hand-construct 20–50 held-out questions where you have the source person's actual answer (from public material the model didn't see). Score for both factual accuracy and stylistic match. Iterate.

That's not a research-grade system. It's a starting point. It would tell you, within a few weeks of work, whether the pipeline produces anything useful and where the quality ceilings actually are.

What's unsolved

The unsolved problems, in roughly increasing order of how far away they are:

Tier-3 and Tier-4 corpus collection, pipeline tooling, rights management, person-in-the-loop interfaces.
Adapters that capture reasoning, not just style, research-grade work, partial techniques known, no clear winner.
Calibrated uncertainty, knowing when the model is operating from the corpus versus making things up.
Distribution and licensing, once you have an artifact, how does anyone share it, license it, get paid for it.
A real evaluation methodology, the deepest problem and the one that gates every other improvement.

None of these are coming this year. Most of them are coming. Worth tracking which moves first.

The exercise of walking the stack end-to-end is useful even when most of the layers are half-built. It tells you what an MVP can actually do (capture style, retrieve facts, ground answers in source material), what it can't (reason like the source person on novel questions, evaluate itself), and where the most productive engineering attention should go (the corpus and the eval, in that order, because both are bottlenecks for everything downstream).

The foundation is closer than it was six months ago. It's still not close enough to ship. But the components are individually tractable, and the gaps between them are increasingly clear.