My biggest AI mistakes of the past 12 months

An honest retrospective. The mistakes I made with AI in 2025-26, over-trusting model output, under-investing in evals, picking the wrong abstractions, skipping verification. The lessons I actually learned, written down so I don't have to learn them again.

My biggest AI mistakes of the past 12 months

Twelve months is enough time for the mistakes to have consequences, and not so much time that I've conveniently forgotten them. So here's the honest list, the things I got wrong about AI between roughly May 2025 and now, what I learned, and what I'm doing differently.

I'm writing it because the AI conversation in public is almost entirely about wins. The demo that worked. The agent that shipped. The benchmark that ticked up. The mistakes I made are not the kind that go in a quarterly all-hands. They're the quieter, more expensive kind, and the only way I get the lessons to stick is to write them down.

1. I over-trusted model output for longer than I should have

The first one is the one I'd hoped I was past, and wasn't.

Through most of 2024 I'd built a healthy skepticism toward model outputs in domains I knew well. Code I could read. Architecture decisions I could pressure-test. Numbers I could check. Then through 2025 the models got noticeably better, and the skepticism eroded in the domains where I couldn't easily check, research summaries, parts of long technical docs the model condensed for me, legal language in documents I was reading for my own purposes.

The mistake wasn't trusting the model. The mistake was letting the trust scale faster than my ability to verify did. I leaned on a model summary of a multi-page technical document and acted on a detail that turned out to be directionally right but materially wrong on the specifics. I caught it on the next pass because I went back to the source. If I hadn't, the decision would have been wrong, and the cost of being wrong on that decision would have grown.

The lesson: model fluency is not the same as model correctness, and the gap between the two widens exactly in the domains where I'm least equipped to notice. The fix isn't to trust models less in general. It's to map, for each domain I work in, whether I have an independent check (a primary source, a tool, a second model with a different failure mode) and to refuse to act on output I can't verify against the source.

2. I under-invested in evals for too long

I knew I needed evals. I wrote about needing evals. I did not actually build them.

For the first half of 2025 I was shipping agent workflows on my own stack on vibes. Manual spot checks. Eyeballing the outputs. "Looks right." When something broke I'd patch the prompt, run it through the same three test inputs I always used, and ship the fix. The eval surface was three inputs and my own pattern recognition, and that's not an eval surface; that's a guess.

The cost showed up as a slow buildup of regressions. Things that used to work stopped working. Behaviors I'd specifically prompted for drifted out. I didn't notice the drift in any single moment; I noticed it when a workflow that had been reliable for months started producing output that was visibly worse, and I couldn't tell when the degradation had started.

The lesson: evals are the only way to know whether a change is an improvement or a regression. The bare minimum is a few dozen held-out inputs with expected behaviors, run automatically on every prompt or model change, with the results logged. I now treat "I want to change this prompt" as gated on "is there an eval that will tell me whether the change made it better." If there isn't, I build the eval before the change.

3. I picked the wrong abstraction layer, twice

I built two pieces of my own AI stack in 2025 on abstractions I had to rip out.

The first was a workflow framework I was excited about that turned out to be built for a different shape of problem than the one I had. I spent three weeks fighting it before I admitted that the fight wasn't going to end. The second was an agent orchestration layer I built on top of a vendor SDK that, six months later, the vendor deprecated in favor of a different model. Both rewrites were expensive in the most expensive currency I have, weekend time.

The deeper mistake was the same in both cases: I picked the abstraction because it was the most exciting one, not because it matched the problem. The exciting framework had the best demos and the most active community; the agent SDK was the one everyone was writing about. Neither was the right fit for what I was actually building.

The lesson: pick abstractions based on the shape of your problem and the lifetime of your investment, not on which one is most fashionable when you start. The boring choice (fewer dependencies, simpler primitives, easier to replace) was the right choice both times. I've now defaulted to picking the lower-magic abstraction unless I have a specific reason to take on the dependency. (Building agents inside an MCP-only architecture is the version of this that did work.)

4. I ignored verification on outputs I "knew" were right

The most embarrassing mistake. I had verification machinery, tool-call logging, output diffing, the whole thing I'd written about caring about, and I skipped it on the workflows I felt confident in. The ones where I "knew" the model was going to do the right thing.

The model did the right thing most of the time. The times it didn't. I had no record of what had gone wrong, because I'd turned off the logging to save on noise. When an agent in my homelab started producing a malformed output one morning in a way I couldn't reproduce. I had nothing to go on. I had to turn the logging back on, wait for it to happen again, and only then could I diagnose it. A week of intermittent weirdness I could have skipped if I'd kept the floor where it belonged.

The lesson: verification is not a thing you turn on for the unfamiliar workflows and off for the familiar ones. It's the floor. The familiar workflows are exactly the ones where I'm most likely to miss a regression, because I'm not looking. I now log every tool call, every prompt input, every structured output, on every workflow, by default. (Why I log every tool call now is the longer version of why.)

5. I confused capability with reliability

A capability demo is a model doing the thing once, under conditions you set up. Reliability is the model doing the thing the hundredth time, under conditions you didn't anticipate.

I built two pieces of my AI workflow in 2025 where I'd confirmed the capability and moved straight to relying on them without confirming the reliability. Both worked in the demo. Both failed in unattended use in ways that were obvious in hindsight: one couldn't handle inputs that were just slightly off the shape I'd tested; the other degraded silently when the underlying model version got swapped behind the API.

The lesson: capability is necessary and not sufficient. The thing I now budget for, on every AI-built workflow, is the reliability work, the failure-mode mapping, the stress testing on adversarial inputs, the version-pinning, the regression suite. That work takes longer than the capability work. I keep underestimating it. I'm trying to stop.

6. I let the AI talk me out of my own judgment

This is the one I'm most cautious about putting in print, because it sounds like a small thing and it isn't.

There were a handful of decisions I made in 2025 where the model and I disagreed, and the model's argument was more articulate than mine, and I deferred. In two of those cases the model was wrong and I knew it was wrong and I went along anyway because the argument sounded clean. One was a technical architecture call where the model's preferred shape would have been clearly worse for the constraints I knew about and the model didn't. One was an editorial call on a piece of my own writing, the model rephrased a paragraph in a way that sanded off the specific point I'd been trying to make, and I let the smoother version through because it read better in isolation. I had to find the original and put the point back.

The mistake wasn't using the model for input. The mistake was treating the model's fluency as evidence that it had thought about the thing harder than I had. It hadn't. It had thought about it more articulately. Those aren't the same.

The lesson connects to the individual cognition as IP point I've been writing about. The cognitive work I bring to a decision, the decades of accumulated judgment about systems, the specific intent behind a sentence, whatever, is the thing I have that the model doesn't. Outsourcing the judgment to the more articulate participant in the conversation is exactly the move that gives up the thing I should be holding onto.

What I'm doing differently

A short list, because the lessons are only worth writing down if they translate into different behavior.

  • Verification floor. Every workflow has logging on, every prompt change is gated on an eval, every output I act on has an independent check or I don't act on it.
  • Abstraction discipline. Lower-magic by default, replace-ability over excitement, no fashionable dependencies without a written reason.
  • Reliability budget. Every AI capability I add to the stack gets at least as much reliability work as capability work, and usually more.
  • Judgment hold. When the model and I disagree, I write down my reasoning before I read the model's response again. If I still think I'm right, I act on my read, not the model's.
  • Mistakes log. This essay is the first entry in something I should have started a year ago. The mistakes I write down are the mistakes I don't repeat. The ones I don't write down, I do.

The honest part

None of these mistakes are exotic. They're the standard ones, in the standard order, made by someone who knew better and made them anyway because the AI was good enough that the shortcuts felt safe.

The AI is good. The shortcuts are still not safe. The work that gets put off (evals, verification, reliability, holding onto your own judgment) is the work that compounds. The work that gets done in its place (vibes-based shipping, fluency-based deference, fashion-based abstraction picks) is the work that produces the mistakes I just listed.

I'd rather write the list than repeat it. So here it is, with the year's name on it, so future me has something to compare against.