Automation

The first thirty days of an AI inside my editor

Thirty days of letting an agent actually drive, write the code, run the tests, fix the failures, and then noticing the patterns in what changed. Not the productivity story. The discipline story.

Sid Smith

05 Mar 2025 • 6 min read

Thirty days ago I made a small commitment: for one full sprint of personal-project work, the agent would actually drive. Not auto-complete suggestions. Not the chat sidebar I tab over to when I'm stuck. Real driving, write the change, run the tests, fix the failures, propose the commit. I'd review and steer. I wouldn't write the lines myself unless the agent had genuinely run out of useful next moves. The shape that experiment took ended up looking a lot like the terminal-hosted agent model that Claude Code formalized this week, but the experiment started before that product existed, on a stack that was already converging toward the same shape.

The point of the experiment wasn't to measure productivity. Productivity numbers from this kind of A/B are noisy and personality-dependent and mostly tell you what the person already wanted to believe. The point was to notice what changed about the rhythm of work, and what discipline got harder when the lines weren't coming out of my own keyboard.

Worth writing down what landed before the patterns get smoothed over by familiarity.

What changed in the rhythm

The biggest change is that the unit of attention shifted from the line to the diff. When I write code myself, I'm looking at the cursor, what comes next, what symbol fits here, what's the right name for this variable. When the agent is driving, I'm looking at the diff, what's the shape of the change, does it fit the architecture, is there a pattern here that should be applied elsewhere. The atomic act of programming moved up a level of abstraction, and most of the day became reviewing rather than typing.

The good version of this is that I spent more time on architecture and less on boilerplate. The boring side of the day, the API client wrappers, the test scaffolding, the schema migrations, the JSON-shape-to-typed-class translations, got handed off in a way that didn't feel like cutting corners. The result was usually fine. When it wasn't, I noticed faster than I would have if I'd typed it myself, because reading is a different cognitive mode than writing.

The bad version of this is that the diff-attention mode slides toward shallow review when the diff is large, the day is long, or the agent is being confident. A 200-line change that looks plausible at first glance doesn't get the line-by-line attention a 200-line change you wrote yourself would have. The agent is good enough that "looks plausible" is approximately right most of the time. The cost is the small fraction of the time when "looks plausible" is wrong, and the wrongness was load-bearing, and you don't catch it until something downstream breaks.

Where the trust line lands

A few weeks in. I had a working heuristic for when to trust the agent's output more or less. It's not very surprising:

Trust higher when the change is local, the test coverage is real, and the code shape is something the language tooling has lots of examples of. Refactoring inside a single file. Adding a new endpoint that mirrors three existing ones. Wiring up a library call where the library is well-documented.
Trust lower when the change crosses module boundaries, the test coverage is thin, the API is unusual, or the work is more about picking the right design than implementing a chosen design. Splitting a service. Designing a cache layer. Anything where the right answer depends on knowing why the existing code is the way it is.
Trust lowest when the agent is pattern-matching to something it's seen a lot of and the situation is just different enough that the pattern doesn't apply. This is the failure mode that's hardest to catch from the outside. The change looks like working code in the dialect the agent learned. It just isn't right for this codebase.

The thing this maps onto, eventually, is the same trust-allocation work I'd do with a junior engineer. Which is to say: the agent isn't doing something different from what a person would do. It's doing the same thing faster and with different blind spots.

What got harder

A few specific disciplines got harder during the experiment. Worth being honest about which ones.

Reading my own code in a week's time. Code that the agent wrote, that I reviewed and approved, looks subtly less familiar a week later than code I wrote myself. It's still mine. I approved it, I shipped it, I own it. But the muscle memory for why this exact line is the way it is is weaker. Going back to debug it is a fractionally slower start than it would have been otherwise. Not enough to regret the trade, but real.

Resisting the agent's confidence on architectural decisions. The agent will offer an opinion on the right way to structure a thing, and the opinion is usually defensible. It's frequently not the best answer for the specific code, but it's a fine answer. The pull toward "well, that's reasonable, let's go with it" was stronger than I expected. Getting deliberate about pushing back when I had a better answer required real intent.

Keeping the test suite honest. When the agent writes the implementation, the agent often writes the tests too. The risk is that the tests are written to the implementation rather than to the spec. Tests that pass against implementations that should be wrong are worse than no tests. Catching this required being explicit (write the tests first, then the implementation) and even then it took attention.

Knowing when to start over. When the agent gets a task wrong by enough that the right move is "discard everything and re-prompt with a different framing," it's tempting to try to steer it back instead. The steering usually wastes more time than the discard would. Recognizing the moment to throw away the partial work was a skill I had to deliberately practice.

What got easier

The flip side, which is easier to undersell:

The friction of starting got dramatically lower. Tasks that would have taken twenty minutes to scaffold (set up a new test file, get the imports right, stub the function signatures) happen in a couple of conversational turns. The mental cost of "I should write this small utility" or "I should clean up this directory" is now small enough that I do it. The cumulative effect of doing the small useful things you used to put off because the activation energy was too high is not nothing.

Documentation. Comments. README updates. Commit messages. The category of "things every engineer agrees should be done and most engineers do less of than they should" gets done now, almost incidentally, because the agent is happy to write the first draft and the marginal cost of editing the draft is lower than the marginal cost of writing it from a blank page.

The work I'm worse at (anything outside my actual specialty) got more accessible in a way that has its own risks but is mostly a positive. Writing a frontend feature in a framework I haven't used in three years, calling an API I've never touched, debugging an obscure deployment failure in a language I read but don't write, the agent makes "I'll just take a swing at it" a reasonable first move. Whether that's good for my long-term skill range is a separate question.

What this isn't, despite the framing

Worth being explicit: this isn't a "the agents are coming" piece. The category has been here for a while now, in shapes that ranged from in-line completion to chat-sidebars to the terminal-hosted form that's dominant right now. I wrote a couple of years back about ChatGPT plugins as an early signal that this category was about to be load-bearing for how engineers work. The thirty-day experiment was about what happens after you stop debating whether to use the tool and start figuring out how to use it well. The arrival is no longer the story. The integration discipline is.

The thing the experiment didn't settle

What thirty days didn't tell me is whether the rhythm I've landed on is the right one or just the one I happened to optimize toward. There's a version of this experiment where the agent does more, drafts the whole feature, including the tests and the docs and the deployment scripts, and I just review and merge. There's a version where the agent does less (proposes specific changes that I evaluate and either accept or write myself) and I keep more direct contact with the code.

The middle position I ended up at is comfortable, and a meaningful fraction of comfort is just "this is what I'm used to." Whether the durable equilibrium is here or somewhere else is something the next thirty days, and the thirty after that, are going to keep teaching me. The shape of working with an agent is still being learned in public. Mostly what these thirty days proved is that the answer isn't "no", and that the answer "yes. Here's the new normal" is going to take more than a sprint to arrive at honestly.