The Bootstrapping Has Begun

Every frontier AI lab is now using its own models to build the next version of itself. The bootstrapping of the self improvement loop has begun and is likely to accelerate.

In March 2026, three things happened within the same two-week window. Chinese AI lab MiniMax released M2.7, which they described as "our first model deeply participating in its own evolution." OpenAI's GPT-5.3 Codex had already shipped in February with the claim that it was "instrumental in creating itself." And Andrej Karpathy open-sourced a 630-line Python script called autoresearch that lets anyone with a single GPU run autonomous ML experiments overnight — no ML background required.

Each of these, in isolation, is interesting. Together, they describe a phase transition that the AI safety community has been warning about for years and that the AI industry has been quietly building toward: recursive self-improvement. Not as a hypothetical. As a shipping feature.

What "self-improvement" actually means right now

Let's be precise: Nobody is claiming that GPT-5.3 Codex sat in a room and redesigned its own neural architecture from first principles. What OpenAI said is that early checkpoints of the model were used to debug training runs, manage deployment infrastructure, and diagnose evaluation failures. The engineering team used Codex to optimise the harness for the next version of Codex. Later checkpoints helped scale GPU clusters during launch and kept latency stable under traffic surges. That is not the Singularity. But it is a model materially accelerating the development of its own successor. And there is no obvious reason why that loop gets slower over time.

MiniMax's framing is more explicit. Their M2.7 release describes a development workflow in which the model updates its own memory, builds dozens of complex skills within its agent harness, and runs reinforcement learning experiments — then improves its own learning process based on the results. They ran over 100 autonomous iteration cycles across three trials, each running 24 hours. The result: a 30% performance improvement on internal evaluations, and a 66.6% medal rate on 22 ML competitions, tying with Gemini 3.1 and approaching GPT-5.4. MiniMax estimates that M2.7 handled 30 to 50 percent of its own development workflow.

The human in this loop still exists. The human designs experiments, steers direction, and reviews results. But the ratio of human input to model output is shifting — and it is shifting in one direction.

The Karpathy loop

If you thought recursive self-improvement required a frontier AI lab, Andrej Karpathy just made the counterargument. On March 7, Karpathy pushed autoresearch to GitHub and went to sleep. By morning, an agent had run 50 experiments on his training code, found 15 improvements including a bug in his attention implementation he had missed, and committed every result to git. All of this happened on a single GPU. The improvements stacked and transferred to a larger model, producing an 11% speedup on training efficiency.

The architecture is almost aggressively simple. Three files: prepare.py (data and utilities — do not touch), train.py (model, optimiser, training loop — the agent modifies this), and program.md (instructions for the agent, written in Markdown). The agent reads its own source code, forms a hypothesis, modifies the code, trains for exactly five minutes, evaluates against a baseline, and either commits or discards. Twelve experiments per hour. A hundred overnight.

Within two weeks, the repo had over 50,000 GitHub stars and hundreds of forks. Shopify's CEO Tobias Lütke pointed it at Liquid, the templating engine behind every Shopify storefront, and woke up to 53% faster rendering and 61% fewer memory allocations from 93 automated commits. The pattern had escaped the lab.

But the most interesting file in the repo is not train.py. It is program.md. It carries three registers simultaneously: instructions (what to search for), constraints (what must not change), and stopping criteria (when to report back). YAML encodes structure but not reasoning. Python is executable but not legible as strategy. JSON has no narrative. Markdown sits at the exact intersection of human editability and agent parseability. This is the same pattern behind CLAUDE.md files, Cursor rules, and a growing number of agent governance documents. The human programs the organisation. The agent programs the model.

397 billion parameters on a laptop

The autoresearch pattern is spreading beyond ML training, and the results are starting to look like science fiction. Dan Woods published a project called flash-moe that runs Qwen 3.5 397B — the full 397 billion parameter Mixture-of-Experts model, 209 GB on disk — on a MacBook Pro with 48 GB of RAM. Not a distilled version. Not a smaller model. The actual frontier-class model, running locally at 5.7 tokens per second sustained, using about 5.5 GB of resident memory.

He did not write any of the code. Not the 5,000-line Objective-C inference engine, not the 1,100 lines of Metal shaders, not the 2-bit requantization pipeline, not the tests. Claude wrote all of it. His role was to give Opus the idea, the right reference materials — Apple's "LLM in a Flash" paper, reverse-engineering work on the Apple Neural Engine, and Karpathy's autoresearch repo — and let it run. The whole journey took 24 hours and 90 experiments. 42% of those experiments were discarded. The failures were as informative as the successes.

The technical story is worth understanding because it illustrates how the autoresearch loop finds things humans would not search for. Apple's unified memory architecture wires the CPU, GPU, and SSD controller together on a single chip. The M3 Max does 17.5 GB/s sequential reads from the SSD — three times faster than what Apple measured on the M1 Max in their own paper. MoE models are absurdly sparse at inference time: Qwen 3.5 397B has 512 experts per layer but activates only 10 per token, and the autoresearch loop found you can prune that down to 4 with no quality degradation. Less than 2% of expert weights are needed for any given token. Two-bit requantization cut expert storage from 209 GB to 120 GB.

But the most counterintuitive finding — the one that a human engineer would likely never have tried — was that deleting the carefully engineered 9.8 GB GPU cache made everything 38% faster. Claude had built a sophisticated application-level LRU cache in GPU-visible shared memory, and it was actively hurting performance. The Metal cache's GPU-visible pages were forcing Apple's hardware memory compressor to run continuously at 60,000 to 130,000 decompressions per second, burning 1–2 GB/s of memory bandwidth on housekeeping. Remove the cache, the compressor goes quiet, and all that bandwidth becomes available for inference. The same lesson PostgreSQL's documentation teaches about not competing with the OS buffer cache — discovered autonomously at 3 AM by an agent running experiments while the human slept.

The theoretical throughput floor, limited only by SSD bandwidth, is 18.6 tok/s. The current 5.7 tok/s means the hardware is barely breaking a sweat. The M4 Max should reach roughly 8 tok/s with zero software changes. Within two to three hardware generations, 10+ tok/s on a 400 billion parameter model on a laptop is the baseline.

This is what the autoresearch loop looks like when you point it at a problem that is not hyperparameter tuning. It is a frontier model, writing a low-level inference engine from scratch in Objective-C and Metal, running experiments against a performance metric, and finding hardware-level optimisations that the human directing it did not know to look for. The human provided direction and systems-level insight at plateau points. The agent did everything else.

Google was early

Google DeepMind's AlphaEvolve, released in June 2025, did something that had eluded mathematicians for 56 years: it found a way to multiply 4×4 complex-valued matrices using 48 scalar multiplications, beating Strassen's 1969 algorithm by one operation. The reduction sounds trivial. It is not. Matrix multiplication is recursive — improvements compound exponentially as matrices grow. In a world where AI training runs perform billions of matrix multiplications per second, one fewer operation at the base case translates to measurable savings in energy and compute.

But the self-referential part is the important one. AlphaEvolve optimised a matrix multiplication kernel used to train Gemini — the very model family that powers AlphaEvolve itself. A 23% speedup on that kernel. One percent reduction in end-to-end training time. AlphaEvolve also improved Google's data centre scheduling, recovering 0.7% of global compute resources — a number that sounds small until you remember the denominator.

This is the definition of recursive self-improvement: a system discovering optimisations that make the next version of itself faster to train. Not through some mysterious emergent behaviour, but through the unglamorous work of kernel optimisation, scheduler tuning, and hyperparameter search — performed by an agent that does not sleep, does not get bored, and explores solution spaces that humans would not think to search.

What Anthropic is not saying

Anthropic has been characteristically circumspect about using the phrase "recursive self-improvement." They are Anthropic; the phrase comes pre-loaded with existential risk connotations and they wrote the book on that. But look at what they are doing. A September 2025 article about building agents with the Claude agent SDK contained this sentence: "Over the past several months, Cloud Code has become far more than a coding tool. At Anthropic, we've been using it for deep research, video creation, and note-taking among countless other non-coding applications. In fact, it has begun to power almost all of our major agent loops."

A July 2025 article described autonomous loops where Claude Code writes new features, runs tests, and iterates continuously — given abstract problems to work on autonomously, with humans reviewing solutions before final refinements. In the last two weeks of early 2026, Anthropic shipped faster than any other organisation in the industry. Draw your own conclusions about what is powering that velocity.

The flash-moe project is an unintentional case study in Anthropic's position. Their model wrote an entire inference engine — Objective-C, Metal shaders, requantization pipeline — and then optimised it through 90 autonomous experiments. Anthropic does not need to say "recursive self-improvement." Their model is already doing the work that makes recursive self-improvement possible at other organisations.

The soft version of the hard takeoff

In AI safety literature, "hard takeoff" refers to a scenario where recursive self-improvement leads to an intelligence explosion — a rapid, potentially uncontrollable acceleration in AI capability. Leopold Aschenbrenner's "Situational Awareness" paper, published after he left OpenAI, laid out the argument with a graph showing effective compute on the y-axis and time on the x-axis. The curve goes vertical once automated AI researchers replace human ones.

The responsible version of the argument acknowledges where we actually are on that curve. We are not at the vertical part. We are at the inflection — the point where the curve starts bending upward, where the second derivative turns positive, but the rate of change is still legible to humans.

The evidence for this position:

MiniMax's model handles 30–50% of its own development workflow. Not 100%. OpenAI's Codex debugged parts of its own training pipeline. It did not design the training run. Karpathy's autoresearch finds better hyperparameters and catches bugs. It does not propose novel architectures that leap the field forward. Dan Woods' flash-moe project produced a working inference engine, but a human had to provide the conceptual direction and intervene at plateau points. AlphaEvolve improved Strassen's algorithm by one multiplication. Not by ten.

Each of these is a partial automation of a previously human task. The automation is real, measurable, and compounding. But the human is still in the loop — designing experiments, setting objectives, reviewing results, deciding direction. Sam Altman's October 2025 timeline called for "an automated AI research intern by September 2026" and "a true automated AI researcher by March 2028." Those goalposts now look conservative, but they remain distinct: an intern follows instructions and surfaces findings. A researcher proposes novel directions.

The distinction matters because it is the difference between acceleration and explosion. Acceleration compounds gradually. Explosion is discontinuous.

The infrastructure pattern nobody is naming

Zoom out from the individual examples and a common architecture emerges. It is not novel. It is a rediscovery.

I wrote about the Ralph Loop in January — a technique where you run an AI coding agent, let it finish, restart it with fresh context, and repeat. In its purest form: while :; do cat PROMPT.md | claude-code; done. Memory persists not in the model's context but in the filesystem: git commits, progress files, task lists. Each iteration reads the current state, decides what to do, does one thing well, and updates the state for the next iteration. Stateless workers plus durable state beats clever in-memory abstractions. Unix knew this. Distributed systems have known it for decades.

Autoresearch is a Ralph Loop. Each five-minute training run is a fresh context. State persists in train.py and results.tsv. The agent reads, modifies, evaluates, commits or discards. Flash-moe's 90 experiments follow the same structure — each iteration starts clean, reads the codebase, forms a hypothesis, runs the experiment, and persists the result. MiniMax's 100+ autonomous cycles are the same pattern at the scale of a frontier lab's research infrastructure: fresh iteration, durable state, evaluate, keep or discard.

This is the pattern that makes recursive self-improvement mechanically possible. Not some exotic new architecture, but the oldest trick in systems engineering: treat the worker as disposable, treat the state as sacred. The innovation is not the loop. The innovation is that the worker inside the loop is now smart enough to do research.

I have been building toward this from a different direction. In reWritable, a single HTML file contains its own editor. The file loads from localStorage, the user describes a change in natural language, and the LLM rewrites the entire application source. The file on disk is an immutable 20-line bootstrap — a seed. Everything else is what the agent has made so far. And clive is the same thesis in the terminal: the agent inhabits the environment, reads its own screen, types keystrokes, watches what happens, repeats.

None of these systems get smarter in the Aschenbrenner sense. But they share the same structural property: the artifact participates in its own modification, and the loop is durable enough to compound. Whether the artifact is a training script, an inference engine, a scheduling algorithm, or a single-file HTML application, the architecture is identical. Read state. Propose change. Evaluate. Keep or discard. Persist. Repeat.

What compounds

Here is what I think is underappreciated about this moment: the improvements are not isolated. They feed into each other.

AlphaEvolve improves matrix multiplication kernels. Those kernels make training faster. Faster training means more experiments per dollar. More experiments per dollar means autoresearch finds better configurations. Better configurations mean smaller models match larger ones — or, as flash-moe demonstrates, frontier-class models run on hardware you already own. Consumer hardware means individual researchers can run autonomous experiments. Individual researchers push improvements upstream. Upstream improvements get incorporated into the next frontier model.

This is not a single recursive loop. It is a network of recursive loops, each feeding into the others. Kernel optimisation feeds training efficiency feeds model capability feeds tool quality feeds kernel optimisation. The compounding happens across layers, not just within them.

And the access cost is collapsing. Autoresearch runs on a single GPU. QLoRA fine-tuning fits 7B-parameter models on a consumer RTX 4070. MLX ports run autoresearch on a Mac Mini M4. A 397B-parameter model runs on a MacBook Pro with 48 GB of RAM. The gap between "what a frontier lab can do" and "what a person with a laptop can do" is closing at a rate that makes linear extrapolation unreliable.

The uncomfortable question

The standard response to hard takeoff anxiety is: there are still humans in the loop. And that is true. There are. But the loop is getting shorter.

MiniMax's model handles 30–50% of the workflow today. If the next version handles 70%, and the version after that handles 90%, the human's role shrinks to setting the initial objective and reviewing the final result. That is not a discontinuity. It is a gradient. But it is a gradient that points in one direction.

The more immediate question is not whether recursive self-improvement leads to superintelligence. The question is what happens to the field of AI research when the research itself is partially automated. When Karpathy's 630-line script finds improvements that he missed after months of hand-tuning. When an agent writes 6,000 lines of Objective-C and Metal to stream a 397B model off an SSD at speeds a human engineer did not think were achievable. When MiniMax's agent discovers parameter combinations through systematic search that no human researcher would have tried. When AlphaEvolve proposes solutions to problems that have been open for half a century.

What happens is that the bottleneck shifts. From "can we build a better model" to "can we define what better means." From implementation to specification. From coding to judgment.

This is the soft version of the hard takeoff: not an intelligence explosion, but an engineering velocity explosion. Models get better faster because models help make models better. The curve bends upward. The humans are still driving. But the road is getting steeper, and the car is getting faster, and nobody is entirely sure where the brakes are.

The bootstrapping has begun. It is not dramatic. It is not cinematic. It is an overnight cron job on a single GPU, committing improvements to a git branch while the researcher sleeps. It is 5,000 lines of Objective-C that a human never wrote, streaming 397 billion parameters off a laptop SSD. And that might be the most unsettling part — how ordinary it looks from the outside.

Sources: flash-moe by Dan Woods (thread) | autoresearch by Andrej Karpathy | MiniMax M2.7: Early Echoes of Self-Evolution | Introducing GPT-5.3-Codex | AlphaEvolve