You Cannot Out-Build Your Grader

The idea of a software factory is old and keeps coming back. Every decade or so someone proposes that we stop making software the way artisans make furniture and start making it the way manufacturers make everything else: standard processes, reusable parts, an assembly line, output you scale by adding capacity rather than geniuses. The Japanese electronics giants built software factories in the 1970s and 1980s. Microsoft-associated work revived the term in the 2000s with Software Factories: Assembling Applications with Patterns, Models, Frameworks, and Tools. The agentic version is the one in front of us now: a crew of models that can write, test, and ship code from a brief, and it makes the old dream look reachable in a way it never quite did before.

The metaphor carries an assumption it never states, and that assumption is what this piece is about. The simplified factory story assumes a part made to a fixed specification and a quality gate that checks the part against the spec. But a factory that improves software has no fixed spec to check against. Its whole job is to find out whether a change made the thing better, out in the world where the thing runs, and that finding out, not the production, is the entire difficulty.

I thought I was building a factory and trying it out on clive, an environment-interface agent where the model does not call a tidy tool API but lives in a terminal: it reads the screen, types, watches what happens, and repeats. But in reality, I had the relationship kind of backwards. clive was never the test case the factory ran against. It was more like the entity the factory actually grew around. As clive got more capable the factory had to become a larger, more complex and - yes - strange in a sense.

At the core is that a factory can only improve what it can measure. Thus the factory complexity grows with the difficulty of grounding the target: how hard it has become to tell whether a given change helped. That is not the same as the target’s complexity - it's a kind of meta complexity for the factory and those two usually climb together.

This in contrast with other well established systems like autoresearch. It improves a target of unbounded internal complexity, arbitrary surgery on a neural network, with three files and a single agent. It can do that because however complex the model becomes, the measurement stays cheap and clean: run the experiment for five minutes, read validation bits per byte, keep the change only if the number gets better.

Consequently, a complex target with an easy sensor needs a trivial factory. A simple-looking target with fuzzy success conditions may need a much stranger one. So it's not simply about growing the factory with the target. It is: grow the factory exactly as much as the target’s ground truth gets harder to read, and not one component further. You cannot out-build your grader.

Which tells you how to start, and it is the opposite of building the fleet you eventually want. You begin with one agent that writes code and one grounded check that can tell whether the code did the job. Then you add capability only where that simpler factory has gone blind, and every piece you add is the answer to a specific blindness rather than a thing included because it sounds powerful. A held-out set, because the optimiser started passing the visible tasks while failing the rest. An adversarial tester, because nobody was manufacturing the hard cases. A conductor, because the work had grown past what a single planned batch could hold.

Each new layer is earned by a specific way the previous layer failed. You do not design the fleet, it's an organic process that shapes itself with growing autonomy from a hand-cranked core, one level up. There is no clean lockstep tale to tell here, and it's a story of co-evolution. The clearest single picture of this co-evolution of factory and product can be found in the logs when it was building clive. The factory did not design its own safety boundary and impose it on clive. It read clive’s own constitution, the tiers clive already marks immutable, and froze those. The target reached up and shaped the structure of the thing built to improve it.

I think this is the central most interesting part the software-factory metaphor usually misses. The factory is not standing outside the product, stamping parts. It is attached to the product by the sensors and they evolve together. When the sensor is clean, the factory can be small. When the sensor gets murky, the factory must grow eyes, memory, adversaries, reviewers, and eventually management. The target does not merely receive the output of the factory: it defines the conditions under which output can be called better.

The principle has an intriguing perspective: what when the factory is pointed at itself? A tool like clive can be graded by whether the command worked. A research loop can be graded by validation loss. But a factory is a system of proposal, judgment, safety, orchestration, repair, and restraint. To improve that system, the next factory needs a grader for all of it - which is a much taller order, but a very fascinating one.