This is the sixth post in my series about working with an AI coding assistant. The earlier posts covered getting started, observations, rules, results, and communication. This one is about the gap between a rule the assistant might follow and a constraint it cannot get around.

A rule is a suggestion

In Spell It Out I wrote down a pile of rules for my assistant. Good rules, most of them. The trouble is that a rule in a config file is read by a machine that guesses. It follows the rule most of the time and ignores it some of the time. A rule the guessing machine may or may not obey is a suggestion, not a constraint.

Some people go further and write prompts like "don't make any mistakes." I understand the instinct, but look at what it actually asks. A mistake measured against what? According to which judge? With nothing to check against, "don't make mistakes" gives the assistant nothing it can act on. It is about as useful as no instruction at all.

Without a goal, any road will do

There is a passage in Alice's Adventures in Wonderland that I quoted years ago in TDD without a goal is hard. It belongs here too.

"Would you tell me, please, which way I ought to go from here?"

"That depends a good deal on where you want to get to," said the Cat.

"I don't much care where—" said Alice.

"Then it doesn't matter which way you go," said the Cat.

"—so long as I get somewhere," Alice added as an explanation.

"Oh, you're sure to do that," said the Cat, "if you only walk long enough."

― Lewis Carroll, Alice's Adventures in Wonderland

That is an assistant working from a vague prompt. No destination, so one direction is as good as another, and off it walks. You are sure to get somewhere if it runs long enough. Just not the somewhere your customer is paying for. Back in 2019 I wrote that TDD without a goal is hard. Today I would say it is impossible. If you cannot say where you want to end up, nothing the assistant does next can be right or wrong.

A test is a judge that does not negotiate

A failing test is a different kind of thing from a rule. It names the judge. Red is red, the build is green or it is not, and the assistant cannot argue its way past either. That is the constraint a written rule never was.

Once the destination is fixed, the Cat's answer reads the other way around. Now I am happy to let the assistant take any road, because every road that ends at a green build ends where I wanted to be. The test is the goal written down in a form the machine can hold itself to.

Writing the test first does more than check the result afterwards. It bends the solution while it is being written. The picture I keep coming back to is from maths, fitting a curve to data. You run an experiment, you get a handful of points, and you look for a function that passes through them. With two points almost any curve fits. Add more points and the curve has less room to wander. Tests are those points. Each one nails the solution to another spot it has to pass through, and enough of them leave only a narrow band of implementations that satisfy all of them at once.

This idea has a name

The name is a fitness function. It comes from maths and from evolutionary algorithms, where it measures how close a candidate solution is to the one you are after. Curve fitting is the everyday version, scoring how well a candidate matches the data. The book Building Evolutionary Architectures brought the term into software, and Dave Farley has been talking about it lately in the context of AI. In one of his recent videos he puts it plainly:

You give a very precise specification that can act as a fitness function for the system that you want to build. And then you get it to match the fitness function.

― Dave Farley, Modern Software Engineering

So I am not the only one arriving here. A guest on the same show said out loud the thing I had been circling: who cares how the details are implemented? That sounds reckless if you are used to reading every line, but we already trust code we never read. Nobody inspects the assembler their compiler produces. We trust it because the behaviour is tested and the behaviour is right. Code from an assistant is heading the same way.

I still look, because I am a developer and I want to see what was built, and now and then there is something odd in there, no worse than what I write on a bad day. The exact shape of it just matters less than whether my fitness function pins it down. Maintainability is part of that function too. Zero warnings from the IDE, a cap on method length, no swallowed exceptions. The ones that can be checked automatically are the ones that hold.

Turn rules into guard rails

A small example. I don't like ternary conditions. A ternary is the compact one-line conditional, the part with a question mark and a colon:

String label = count == 1 ? "item" : "items";

The same logic written as a plain if statement runs a few lines longer, and to my eye it reads easier.

String label;
if (count == 1) {
    label = "item";
} else {
    label = "items";
}

So I added a rule against ternaries. Advisory, and the assistant still reaches for one now and then. I push back when I spot them. It is annoying.

The better move is to stop asking. Add a linter check that fails the build on a ternary. It is no longer a request, it is a wall. The build was green before I switched the check on, so the task is easy to state: it worked a moment ago, make it work again. The assistant rewrites the ternaries into plain if statements and the build comes back green.

That only works because the tests have my back. Turning a ternary into an if statement is a refactoring, and a refactoring without tests is just editing code and hoping. With the tests in place, the linter and the assistant can settle it between them while I do something else.

That loop is the whole pattern, and it reaches well past programming. Kathlén Kohn, a mathematician at KTH who studies why deep learning works, gave the same advice in a recent lecture. Treat what these systems produce as ideas to verify, not finished answers, and run the AI next to a verification system that checks the result, around and around with no human in the middle, until the check is satisfied. Her examples ran from filling in a tax return to writing code to designing chips against a specification. Her verification system is the fitness function under another name.

That is the move in general. Any rule you can turn into an automated check should become one. You stop hoping the assistant remembers and let the build decide.

The real fitness function is a happy customer

Tests, linters, warnings: all of them are stand-ins for the one thing I actually care about. Does the software solve the customer's problem well enough that they are happy? That is the fitness function that counts, and the rest exist to keep the assistant pointed at it. Good enough is a real bar, not a lazy one. It also stops me polishing code nobody asked me to polish.

Which puts the job back where it has always been. I handed the typing to the assistant and kept the thinking for myself. Understand the problem, decide what done looks like, write that down as something a machine can check, then deliver it. That was always the hard part, and Farley says the same: we are explorers of problems first. Kent Beck gets there from the economics. As code gets cheap, the value moves to understanding what to build and how the pieces fit together. The assistant does not do it for us.

Resources