SkillOpt boosted GPT-5.5's six-benchmark average from 58.8 to 82.3 — a +23.5 point absolute improvement — without changing a single model weight. That's 5.4 points above the best existing method chosen per benchmark.
+23.5 Points Without Touching Weights
Today's agent skills come from hand-written prompts, one-shot LLM generations, or loose trial-and-error revisions. None of those behaves like a deep-learning optimizer — no step-size control, no held-out validation, no memory of failed edits. Skills drift, grow bloated, and quietly tank performance.
Microsoft Research's SkillOpt reframes the problem: treat the skill file as a trainable parameter outside a frozen target model. The optimizer model proposes small text edits (add, delete, replace), merges and deduplicates them, clips the total edit budget with a textual learning rate, and adopts only candidates that strictly outscore the current skill on a validation split. A rejected-edit buffer feeds negative examples back into later optimizer calls. Separate epoch-wise slow/meta updates consolidate longer-horizon lessons.
Across six benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), seven target models from GPT-5.5 down to Qwen3.5-4B, and three execution modes (direct chat, Codex, Claude Code), SkillOpt delivered best or tied-best results on all 52 evaluation cells. SpreadsheetBench climbed from 41.8 to 80.7; OfficeQA from 33.1 to 72.1; LiveMathematicianBench from 37.6 to 66.9.
A Training Loop for Natural-Language Skills
SkillOpt runs a forward-backward-update cycle entirely in text space. In the forward pass, the frozen target model executes a batch of training tasks with the current skill. The backward pass uses a separate optimizer model reading trajectories in minibatches, distilling patterns to preserve and patterns to correct. The update step proposes edits, merges them, ranks them, and clips them by a per-step edit budget.
Every candidate must pass a strict validation gate — adopted only if it beats the current skill on the held-out split. Rejected edits enter a buffer that acts as negative feedback for later optimizer calls in the same epoch. This keeps the skill converging instead of drifting. The final artifact, best_skill.md, is no opaque parameter blob: median length is roughly 920 tokens, and only one to four edits are accepted into the final file. OfficeQA's +39.0 point gain came from a single accepted edit.
Ablations confirm the controls matter. Removing the rejected-edit buffer lowers scores on all three ablation benchmarks. Removing both the meta skill and slow update drops SpreadsheetBench from 77.5 to 55.0.
Skills That Transfer Across Models and Harnesses
The optimized skill captures reusable workflow knowledge, not overfit benchmark instructions. Skills transfer across model scales — GPT-5.4-mini with a SkillOpt skill (64.3 average) exceeds the no-skill baseline of larger GPT-5.4 (59.7). Qwen3.5-4B, a 4-billion-parameter model, surpasses GPT-5.2's no-skill baseline.
Cross-harness transfer is even more striking. A spreadsheet skill trained inside Codex, dropped into Claude Code with no further optimization, lifted the no-skill baseline from 22.1 to 81.8 — slightly above the 80.4 achieved by training directly inside Claude Code. Since the two harnesses expose different tool surfaces, the skill learned general workflow logic, not harness-specific recipes.
SkillOpt points to a lighter-weight path for domain-adapting agents: instead of finetuning weights or hand-tuning prompts, teams can train a small, versionable, auditable natural-language skill layer wherever automatic evaluation exists. Procedural knowledge outside the model can be optimized — and when that process is controlled and validated, a skill file becomes a stable, transferable adapter between frontier-model capability and real-world workloads.
Source: SkillOpt: Agent skills as trainable parameters
Domain: microsoft.com
Comments load interactively on the live page.