Your Skill File Now Has a Backward Pass

Researchers at Microsoft published the first systematic text-space optimiser for agent skill files last week. SkillOpt, led by Yifan Yang and a team of fifteen, runs a frozen agent against scored tasks with a skill document in its context, then uses a separate optimiser model to propose edits to that document — additions, deletions, replacements — which only land if they strictly improve performance on a held-out validation set.

Across six benchmarks, seven target models, and three execution harnesses including Claude Code and Codex, the optimised skills were best or tied-best on every one of the 52 cells the team evaluated. On GPT-5.5 in direct chat, they lifted the no-skill baseline by 23.5 points. Inside the Codex agentic loop, by 24.8. Inside Claude Code, by 19.1. The resulting best_skill.md files transferred to different models, different harnesses, and nearby benchmarks without retraining.

That last result is what makes this a layer shift, not a prompt-tuning trick. Skills — the SKILL.md files now spreading across Claude Code, Codex, DeepAgents, and the Anthropic Agent Skills standard — have so far been hand-written. SkillOpt treats them as the training target instead of the input, with the discipline that makes weight-space optimisation actually work: held-out gates, bounded learning rates, negative gradients, slow updates.

The validation gate is what makes it work

The training loop is simple to describe. A frozen agent runs a batch of scored tasks with the current skill in context. The optimiser model reads the rollouts, sorts them into trajectories that succeeded and trajectories that failed, and looks across the contrast for what the skill would have needed to say to turn the failures into successes. That comparison is the gradient signal. The optimiser proposes a small bounded set of edits, each candidate skill is tested on a held-out validation set, and only an edit that strictly improves score is promoted.

Concretely: imagine training a skill for a spreadsheet agent. Twenty tasks run with the current skill, twelve succeed, eight fail. Reading the failures, the optimiser notices the agent stumbled when date strings were ambiguous — column entries like 03/04/2024 with no format context. The current skill says to convert dates to datetime objects before computing intervals, but does not cover the ambiguous case. The optimiser proposes an additional clause: when the format is unclear, scan the rest of the column for disambiguating entries before parsing. That candidate skill goes to validation. If score improves, the edit lands. If not, it goes to the rejected-edit buffer, and the optimiser is told this direction did not work.

The held-out gate is the move that distinguishes SkillOpt from earlier self-revision approaches. LLM optimisers are good at producing plausible-sounding rationalisations for why a change should help, and the failure mode of loosely-controlled self-revision is exactly this: the skill drifts toward what the optimiser thinks should work, untested. The validation gate makes the optimiser's opinions cheap and the evaluator's verdict authoritative.

Three further controls do the structural work the paper's ablations show is load-bearing. A textual learning-rate budget caps how many edits land per epoch — too many and a useful rule gets overwritten by a broad rewrite, too few and the skill cannot acquire procedures fast enough to track what the rollouts reveal. A rejected-edit buffer keeps a record of failed proposals so the optimiser cannot re-propose the same dead-end direction next epoch. A slow-update meta-skill sits on the optimiser side and accumulates longer-horizon patterns about how the skill itself is evolving — when to prefer replace over add, when the document has grown too long and needs compression — without bloating the deployed file.

The deployed artifact is one document. Zero extra model calls at inference. The optimiser, the meta-skill, the buffer, the validation runs are all training-time scaffolding, the way an optimiser in PyTorch does not ride along to production.

Skills transfer because they encode task knowledge

The transfer experiments are where the structural claim gets its weight. A skill optimised against GPT-5.4 on LiveMathBench retains its gain when moved to GPT-5.4-nano. A skill trained inside the Codex harness on SpreadsheetBench retains its gain when moved into Claude Code. A skill trained on one math benchmark generalises to a nearby one. The artifact is not overfit to the model that produced it, and not overfit to the harness it was trained inside.

The reason comes from what the skill contains. The optimised file accumulates procedural rules about the task — when ambiguous dates appear in this kind of spreadsheet, here is how to disambiguate; when this kind of tool returns this kind of error, here is how to recover — not procedural rules about the model. Knowledge about the task is portable. A different model executing the same skill against the same domain inherits the same accumulated procedure. A different harness loading the same SKILL.md format gets the same playbook.

This is the layer shift. The recipe — harness plus domain context plus codified judgment — was already where the engineering judgment in modern agent systems lived. The third term used to be the part you wrote. Now it is the part you can train.

The paper positions explicitly against the prior lineage: TextGrad, GEPA, Trace2Skill, EvoSkill, one-shot LLM-generated skills, and human-written skills. SkillOpt beats every one of them on every cell, including beating human-written skills on every cell. Worth taking seriously. Also worth not over-extending — the baselines and benchmarks are the ones the authors chose.

The eval signal is the bottleneck

SkillOpt's six benchmarks — SpreadsheetBench, OfficeBench, DocVQA, LiveMathBench, SearchQA, ALFWorld — share a property worth naming explicitly. They have crisp pass/fail. The trajectory solves the puzzle or it does not. The spreadsheet matches the gold output or it does not. The math answer is right or wrong. That is the gradient the textual backward pass runs on.

Plenty of knowledge work has signal that is almost as crisp. Did the agent cite sources correctly. Did it recover from the tool error before retrying. Did the summary include the required claims. Did it flag uncertainty rather than confabulate. Each of those is constructible as a graded eval, but constructing it costs real work.

What that work looks like in practice: to train a skill for a research-summary agent against the property "did the summary include the required claims and avoid unsupported ones," you have to specify source documents, the claims those documents do and do not support, and a grader — another LLM, a tuned classifier, sampled human review — that judges rollouts against that ground truth at training-time scale. The grader has to be validated for correlation with what you actually care about. None of this is insurmountable. None of it is free.

The piece of the SkillOpt result that should not get lost in the excitement about 23-point gains is that the gains come from settings where the gradient signal was already engineered into the benchmark. Domains without that signal will need to build it before SkillOpt-style training can land.

Hand-written is now a choice, not a default

The skills in your repo right now were written by humans, edited by intuition, and improved when someone noticed they were producing bad outputs. That is the workflow SkillOpt is positioned against — exactly the hand-crafted skills the paper's abstract names as the current default.

The practical takeaway for teams running agent stacks: identify which skill files handle tasks with clean pass/fail signal, and budget for the eval-signal engineering that the rest will need. The skills are still yours. They are just no longer only yours to write.

Backprop trained weights. RLHF trained behaviour. The next layer up is already on arXiv.