Encoding a person: what training data should really be
Three years past the original encoding-a-person framing, the question of what data actually encodes a person has gotten clearer. The answer is less about volume and more about the specific kinds of context that compound over time.
Three years on from the original encoding-a-person framing. I have a clearer answer to the question that piece left open: what data actually encodes a person, in the sense of producing an AI artifact that's useful as a model of who that person is and how they think? The 2023 piece was speculative. The answer that's emerged from three years of people actually building toward this is less about volume and more about specific kinds of context that compound in specific ways.
Worth pulling this together because the consensus among the people doing the work has shifted enough that an updated framing is overdue.
What the 2023 piece got right
The directional take held. Personal AI does compound with personal context. The accumulation of that context does produce something more useful than any single snapshot. The technical foundation to make it work did mature on roughly the timeline I sketched.
What the piece left vague: which specific kinds of personal context produce the compounding, and which are noise. The answer to that has gotten clearer.
What the data actually has to be
After three years of the people doing this kind of work figuring it out, the categories of data that encode meaningfully:
Decision artifacts. Not "what you said," but "what you decided." Notes from times you chose between options. Documented rationales for choices. The post-mortem on a failed project. The "I picked X because Y" texts. These compound because they encode preference structure rather than just information.
Step-by-step drafts of the same artifact. Watching a piece of writing or a piece of code go from first draft to final version. The diffs across drafts encode editorial preferences in ways the final version alone doesn't. A model trained on draft-to-final diffs picks up taste signals that aren't visible in any single draft.
Reactions to other people's work. Your annotations on someone else's writing. Your comments in code reviews. Your responses to ideas in conversation. These encode what you notice and what you push back against, which is closer to how-you-think than declarations of belief are.
Calibrations after being wrong. Records of times you predicted X and Y happened, with your subsequent recalibration. Predictions in public, graded later, with the explicit "I was wrong about Z because A" follow-up. These encode the meta-pattern of how you update, the part of cognition that's hardest to capture from any other signal.
Working notes from active problems. The inline thinking captured while solving a problem, before the problem is resolved. The half-formed ideas, the abandoned approaches, the questions you asked yourself. Resolution-state writing strips this; pre-resolution writing preserves it.
Long-form correspondence with consistent counterparts. Years of communication with the same people, where the relationship and shared context develop. The semantics of the communication compound; the shorthand emerges; the implicit knowledge becomes legible.
These six categories produce most of the encoding value. Most personal data outside these categories is filler at best, noise at worst.
What the data shouldn't be
The categories that don't encode well, against intuition:
Volume of generic communication. Email volume per se doesn't help. The model gets better at imitating your salutations and worse at modeling your decisions. Quantity of generic text isn't a signal.
Social media output. What you posted publicly is a curated artifact, not raw thinking. It encodes your public-facing pattern, which is partial and often misleading about your actual mental model.
Demographic and biographical facts. Your job, your family structure, your background. Useful for surface plausibility; not useful for actual encoding.
Polished final outputs without the drafts. A book, a deck, a finished essay. The polish hides the step-by-step thinking; the model gets a snapshot rather than a process.
Combined activity metrics. Step counts, sleep hours, calendar event counts. Tells you nothing about how the person thinks; the model trained on it learns nothing useful.
Single-point preferences without context. "Prefers blue to red" without "because of X." The preference structure matters; isolated preferences without the structure are noise.
The pattern across the don't-helps: volume without structure doesn't compound. Structured-but-shallow data doesn't compound. The encoding requires data that captures process, decision, calibration, and iteration, not data that captures outputs.
The infrastructure that makes this work
What the people doing this well in early 2026 have built:
Decision logs. Lightweight tools that capture decisions with rationale. Not a journaling app; a structured format the model can index. The teams that maintain decision logs over years have meaningfully more useful AI artifacts than the teams that don't.
Draft preservation. Every meaningful artifact gets its drafts preserved. A simple drafts/ folder structure or a more elaborate version-control workflow. The diffs are the data.
Annotation capture. When you mark up someone else's work, the markup gets preserved. The signal in your reactions to other people's writing is high.
Public-prediction-and-grading discipline. The predictions-graded-publicly pattern is a specific form of calibration data. People who do this well build a record that's hard to construct retroactively.
Working-notes hygiene. The pre-resolution thinking gets saved separately from the post-resolution polish. The working notes are the encoding-relevant data; the polish is for downstream readers.
Long-thread correspondence preservation. Years of email with the same counterparts, structured correspondence, the slow-thread relationships. Preserved with metadata about who, when, and what continuing context.
These aren't exotic. They're the operational discipline that compounds the right kinds of personal data into the kinds that encode meaningfully.
What the encoded model is good for
When the encoding works, what does the resulting model do well?
Drafting in your voice. Not just stylistically, structurally. The model that's seen your draft-to-final diffs produces drafts that match your editorial tendencies, not just your vocabulary.
Predicting your decisions. Given a new decision in a familiar shape, the model predicts which option you'd pick with surprising accuracy. Not because it's reading your mind; because the decision-log corpus encodes your decision-making consistently enough.
Spotting when something feels off. The model trained on your annotations of others' work picks up what you would notice. New material that triggers your would-flag-this pattern gets flagged.
Acting as your devil's advocate. The model trained on your calibration data argues with you in the shape of your past corrections. The internal interrogation pattern that's good for thinking has a useful externalization in this model.
Maintaining the institutional voice. For organizations that have done this for the org rather than for an individual, the encoded model maintains the org's writing voice across more output than any individual could produce.
These are the durable use cases. Less impressive than "general AI assistant" but more genuinely useful as a thinking partner.
What I'd recommend
For someone interested in building toward genuine personal-context encoding:
- Capture decisions with rationale. Build the habit; build the structured format; preserve the records.
- Preserve drafts. Every meaningful artifact, multiple revisions kept.
- Save your annotations of others' work. Reactions are signal.
- Make predictions in public; grade them in public. The calibration data is hard to construct retroactively.
- Keep working notes separate from polished outputs. The pre-resolution thinking is the encoding-relevant part.
- Don't over-engineer the capture. Simple text files in well-organized folders beat complex tools that you abandon after a month.
The 2023 framing left "what data" as an open question. Three years later the answer is clearer: structured data that captures process, decision, calibration, and iteration. Volume doesn't matter much. Structure of the right kinds matters a lot.
The personal AI built on this kind of corpus is meaningfully more useful as a thinking partner than the one built on raw activity logs and email volume. Worth being deliberate about which kind of data you're accumulating; the discipline of capturing the encoding-relevant data is the part of personal AI that compounds the most over time.
Encoding a person is real. The data that does it is specific. Worth being clear about which specific data is the answer to the question the original framing left open.