Source linked

JetBrains SRFT Boosts LLM Agents by Masking Mistakes in Failed Runs

blog.jetbrains.com@systems_wire1 hour ago·Artificial Intelligence·3 comments

Training on failed trajectories with critical step masking lifts Qwen2.5-Coder-32B from 7% to 32.2% on SWE-bench Verified, outperforming standard rejection sampling fine-tuning.

jetbrains researchstep rejection fine tuningswe benchllm agentstraining methods

JetBrains Research found that 76% of steps in failed agent trajectories are actually correct, and they built a training method that exploits that signal to beat standard rejection sampling by 1.3 percentage points on SWE-bench Verified.

The Waste in Rejection Sampling

Standard Rejection-sampling Fine-Tuning (RFT) throws away every trajectory that doesn't end in a successful patch. The SWE-smith dataset, for example, discarded 61% of all collected runs. That's a massive amount of data, and JetBrains' manual analysis of 20 failed trajectories showed why it's wasteful: only up to 24% of the steps in those failed runs were actually harmful. The remaining 76% were productive exploration, codebase navigation, or harmless tool actions.

When you train a student model on a teacher's trajectories, you get two kinds of improvement: learning "smart" tokens (better reasoning, tool use) and learning the path to success. RFT keeps only the second source. But the first source is present even in failed trajectories.

SRFT: Mask Mistakes, Keep Context

Step Rejection Fine-Tuning (SRFT) uses a separate LLM as a critic. That critic scans the entire failed trajectory in one pass and labels each step as good, unnecessary, mistake, or recover. The critic call is cheap compared to generating the trajectory itself.

During training, SRFT does not remove the mistake steps. Instead, it masks the loss for tokens inside those steps. The model still sees the full context, including the mistake and any subsequent recovery, but it is not trained to reproduce the harmful action. This is the same masking technique already used for user prompts in standard next-token prediction.

Results: Failed Runs Alone Lift Base Model 20 Points

JetBrains fine-tuned Qwen2.5-Coder-32B-Instruct on the SWE-smith dataset and tested on SWE-bench Verified. Each experiment was repeated seven times.

  • Base model: 7.0% resolved.
  • Training on 5,000 unresolved trajectories only (with no masking): 27.7%.
  • Standard RFT on 5,000 resolved trajectories only: 30.9%.
  • SRFT on 5,000 unresolved trajectories with masked mistake steps: 29.7%.
  • SRFT on a mix of resolved and unresolved (masked): 32.2%, the best result.

SRFT outperformed standard RFT by 1.3 points, despite using the same total number of trajectories. Even purely failed runs, when processed with step masking, beat a naive mix of resolved and unresolved (28.5%).

The critic strictness matters. Too lenient, and harmful steps leak into training. Too strict, and you lose too many useful steps. Tuning that balance is straightforward with a confidence threshold.

SRFT is a practical method that turns cheap critic calls into a consistent performance boost. If you're generating agent trajectories, stop throwing away the failures. Mask the mistakes and let the model learn from the good steps they contain. The paper is accepted at the DL4C workshop co-located with ICML in South Korea this July.


Source: Step Rejection Fine-Tuning: Squeezing More Signal from Noisy Agent Trajectories
Domain: blog.jetbrains.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.