LLM-authored agent skills don't work—at all. On SkillsBench, human-authored skills improve task pass rates by 16.2 percentage points; LLM-authored skills provide exactly zero measurable gain. That's the problem SkillAxe sets out to solve, and it delivers.
The Skill-Writing Wall
Modern agent frameworks rely on "skill documents"—structured natural-language instructions that tell an LLM agent how to accomplish a task. Write a good skill and your agent works. Write a bad one and it's dead weight. Turns out LLMs are terrible skill authors left to their own devices.
SkillAxe, described in a new paper, decomposes skill quality into four interpretable dimensions: quality impact, trigger precision, instruction compliance with fault attribution, and solution-path coverage. No ground-truth labels, no test suites, no environment rewards—just structured improvement briefs the LLM can use to critique and rewrite its own skills.
Measurable Gains Without Human Help
After applying SkillAxe's iterative self-refinement loop, LLM-authored skills on SkillsBench improved by 28% relative over their unimproved versions. More importantly, that improvement closed 47–67% of the performance gap to human-authored skills. For a fully unsupervised process, that's a serious jump.
The framework works as a continuous improvement engine too. On SpreadsheetBench, SkillAxe built a skill library from past agent trajectories using only 22 skills and raised pass rate from 16.0% to 52.0%. No human intervention, no golden test cases.
What This Unlocks
SkillAxe suggests that the bottleneck isn't LLM capability per se, but the lack of a self-correction mechanism for skill authoring. Hook this into an agent's operational loop and the skills get sharper with every failure. Expect to see similar self-diagnosis approaches baked into agent frameworks rather than treated as an afterthought.
Source: SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement
Domain: arxiv.org
Comments load interactively on the live page.