In February, I wrote about my early experience using an AI assistant for Test-Driven Development, TDD. This is a follow-up. It may become a series, we'll see. A lot has happened in the six weeks since that post, and I have more to say.
Bug reports in plain English
Most of the bugs we deal with in production arrive with two pieces of information: the request object (what something was called with) and a stack trace from the backend. That's usually enough to understand what went wrong on the server side. The harder part is connecting that to the right place in the codebase and understanding why.
I started pasting these bug reports directly into my coding assistant and asking a simple question: why did this happen?
The assistant has the full codebase as context. It can look at the request, read the stack trace, and cross-reference them against the relevant code simultaneously. So far, it has pointed to the likely root cause within a few minutes every time. That said, the assistant's answer isn't always the complete picture. The experience of the developer fixing the bug still fills in the gaps. But getting to a strong hypothesis in under a minute or two, rather than spending fifteen - twenty minutes building context manually, is a real improvement.
Flaky tests
There's a certain kind of bug that's almost worse than a hard crash: flaky tests. It passes nine times out of ten. It fails on a developer's machine on a Friday afternoon. You start to doubt yourself. Is it a race condition? A timing issue? You re-run the suite, it goes green, and you ship, telling yourself you'll look into it later.
Flaky tests are dangerous because they destroy trust in your test suite. Once developers learn that a test fails "sometimes for no reason," they start ignoring failures. That's a dangerous place to be. A test suite people don't trust is worse than no test suite at all, because it creates false confidence while hiding real problems.
The hard part is that flaky tests are, by definition, hard to reproduce. The bug is there. You just keep getting lucky.
Letting the assistant do the repetitive work
My approach was simple: I asked my coding assistant to run the build 20 consecutive times and report back any failures.
Each run took about three minutes on my machine, so this was roughly an hour of work I was effectively delegating. Not glamorous work. Just repetitive, patient execution that I would never have done manually. Who runs their full test suite 20 times in a row on purpose?
The hardest part turned out to be keeping my Mac awake for the duration. That was solved with
caffeinate.
The assistant did it without complaint. And on those 20 runs, it caught two failures.
What it found
Both failures were in React components. Both came down to the same class of issue: improper cleanup when the component unmounted.
In React class components, when you set up subscriptions, event listeners, timers, or async operations, you're
responsible for tearing them down when the component is removed from the DOM. The place to do that is componentWillUnmount.
It's easy to forget. It's easy to do partially. And the consequence isn't always an immediate crash. More often it's
a
subtle memory leak or a state update attempting to happen on an unmounted component.
In our case, both components had a timer that should have been removed when they unmounted, but unfortunately wasn't. Under normal test conditions, running the suite once in sequence, this didn't matter. The test passed before the ghost of the previous component could interfere. But on the 7th run, or the 12th, or under slightly different timing conditions, the leftover state from a previous test would bleed into the next one, and something would fail.
We are a small team. We trust each other and we don't do pull requests or formal code review. These bugs could theoretically have been spotted by reading the code carefully, but in practice, this class of mistake is exactly the kind of thing that slips through. It only reveals itself under repetition, and nobody reads code while mentally simulating 20 consecutive runs.
Baby steps are still good
In my first post I mentioned that I really have to use baby steps. Six weeks on, I want to say it louder.
When I've let the assistant implement larger features in one go, the output is often technically impressive and hard to review. A large diff touching many files at once is difficult to reason about even when each individual change is sensible. I'd find myself scrolling through hundreds of lines of proposed code, losing the thread, missing small details that matter. That's not a good place to be.
The problem isn't that the assistant makes mistakes. It's that a big batch of changes makes mistakes harder to catch. Reviewing code takes concentration, and the more code there is to review, the easier it is to miss something subtle.
So I've started forcing smaller, more deliberate steps. Implement this one thing. Stop. Let me read it, run the build, make sure it makes sense. I commit. Then move on to the next piece. At the end of the workflow I have a refactoring pass, and sometimes it touches code I've already committed. I don't mind. Clean code is clean code, and a dedicated refactoring step at the end beats trying to get the structure perfect on the first attempt.
AI coding tools move fast. You still have to know where you're going.
TDD: the school still matters
Test-driven development remains my preferred way of working. The red-green-refactor cycle is how I have worked for more than two decades. That hasn't changed.
What has changed is a subtle but persistent friction around how the assistant approaches tests.
There are two main schools of TDD. The Chicago school (sometimes called classic TDD) tests behaviour by working through real implementations, only substituting fakes at true system boundaries like external APIs, databases, and the clock. The London school (sometimes called mockist TDD) takes a more aggressive approach to isolation, mocking out collaborators freely so that each unit is tested entirely on its own. Both are legitimate; they reflect different philosophies about what a unit test is actually for.
I prefer the London school. The practical difference in my workflow is about direction: I use the controller to drive the service and finally the database. I use mocks sometimes, but most of my time I drive the design this way. At the system boundary, typically the database, I use a hand-rolled in-memory stub. The in-memory stub is very flexible and plastic, trivial to change, so I can experiment at a very low cost. I make sure there are good contract tests that verify both the stub and the real SQL implementation behave the same way. My assistant prefers to start at the database and work upwards. It keeps wanting to build from the bottom up, and I keep wanting to drive the design from the outside in.
Changing this default behaviour has proven difficult, and so far I haven't fully succeeded. But we move fast enough that it doesn't slow us down much. But if you have strong TDD opinions, expect to push back, and don't assume the tool will naturally match your instincts.
Why this matters
What struck me wasn't just that the assistant found the bugs. It's that it found bugs I had no reason to go looking for. The components worked. Tests passed. Users weren't reporting issues. The bugs were invisible until something was willing to run the same thing 20 times and notice the pattern.
The obvious selling point of AI tooling is speed. Write code faster, get boilerplate generated, have something explain an unfamiliar API. All of that is real and useful. But I think the subtler value is in the patient, unglamorous work that requires no creativity but enormous persistence.
Running a build 20 times. Tracing a bug report across a request object and a backend stack trace. These are tasks that humans are bad at, not because they're hard, but because they're boring. We get bored. We start skimming. We convince ourselves it's probably fine.
The assistant doesn't get bored.
A patient collaborator
A lot of subtle bugs and technical debt survive in codebases because the cost of dealing with them is high. Reproducing a flaky test reliably enough to debug it might take hours of wall-clock time. Tracing a production bug across the full stack might mean 30 minutes of context-building before you even form a hypothesis. Most of the time, the pragmatic decision is to move on. The assistant changes that equation.
The bug fixes themselves were small. A missing cleanup in componentWillUnmount, a timer that outlived
its component. Each fix was a few lines. But the tests are now reliable.
Technical debt works the same way. It sticks around not because nobody knows about it, but because the cost of fixing each small thing adds up to more time than anyone can justify. Writing the characterizing tests, tracing through the tangled code, verifying that nothing breaks. The assistant makes that cost much lower. It can help write the safety net and do the tedious tracing, which means cleanup that used to sit on the backlog for months actually gets done.
And when it comes to building new features? Break the work into pieces. Stay in the loop at every step. Keep a refactoring pass at the end. Push back when the tool's instincts don't match yours. The judgment still has to be yours.
Acknowledgements
I would like to thank Malin Ekholm for feedback.
Resources
- TDD with an AI assistant - the first post in this series
- Test-Driven Development: By Example - Kent Beck, the book that introduced TDD to many of us
- Growing Object-Oriented Software, Guided by Tests - Steve Freeman and Nat Pryce, the foundation of the London school
- Thomas Sundberg - the author