Self-Harness Lets AI Agents Rewrite Their Own Rules, Lifts Performance 33-60%

33-60% performance improvements on Terminal-Bench-2.0 - not from swapping in a bigger model, but from letting the model rewrite its own system prompt, tools, and runtime policies. That's the headline finding from Shanghai AI Lab's Self-Harness framework, and it attacks a problem every agent developer knows: the harness often matters more than the base model.

The Harness is the Real Bottleneck

Most agent failures come from the harness - the wrapper of system prompts, tool definitions, memory management, and failure-recovery logic - not from the underlying LLM. An agent that retries the same failing command blindly, or reports success without checking the output, is a harness failure. Hangfan Zhang, the paper's lead author, told VentureBeat that experienced engineers can still propose better changes than an LLM today. But the bottleneck isn't human capability - it's that harness engineering relies on ad hoc debugging and intuition, not a systematic feedback loop. Self-Harness replaces human guesswork with empirical evidence by letting the agent observe its own execution traces and edit its own harness.

Three-Stage Loop: Mining, Proposing, Validating

The framework runs a tight three-step cycle. First, weakness mining: the agent runs tasks, categorizes failure traces, and detects model-specific patterns. Second, harness proposal: the agent generates a set of minimal, targeted harness modifications tied to those failure modes. Third, proposal validation: candidate edits run regression tests - promoted only if they improve performance without degrading held-out tasks. This loop runs without any human engineer or stronger external model. The system used a minimal harness built on DeepAgent SDK, keeping the model backend, tools, benchmark environment, and evaluator fixed.

Specific Fixes for Specific Models

The quantitative results speak for themselves: 33-60% relative improvements on Terminal-Bench-2.0 for MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. But the details matter more. MiniMax M2.5 had a habit of getting stuck exploring dataset configurations until timeout. Self-Harness wrote a "loop breaker" - force the agent to stop after 50 tool calls and redirect. Qwen-3.5 blindly retried file overwrite errors until it deleted needed files. Self-Harness added a strict command-retry discipline forbidding exact duplicate commands. GLM-5 lost environment changes across commands. Its self-generated harness introduced rules to persist PATH variables across shell sessions. These aren't generic prompt tweaks - they are precise surgical changes driven by failure traces.

The Cost: Compute for Evaluation

None of this is free. Self-Harness trades human engineering hours for API tokens and evaluation infrastructure. Repeated proposal generation, parallel candidate evaluation, and regression testing burn compute. More importantly, the system depends on deterministic, strict verifiers. As Zhang put it, "the evaluation system is not an optional component." Without a reliable ground truth, the risk of promoting bad updates is real. That limits deployment to environments where failures are measurable and trial-and-error is safe - coding, internal workflows, DevOps pipelines. Stay away from medical, safety-critical, or subjective domains.

The role of the engineer shifts from tweaking prompts to designing feedback systems. As models grow more capable, the harness will move outward - connecting to richer environments. Until evaluation becomes robust beyond what humans can verify, the human remains the feedback architect.

Source: Researchers introduce Self-Harness, a framework that lets AI agents rewrite their own rules, boosting performance up to 60%
Domain: venturebeat.com