Source linked

PROPEL verdoppelt nützliche Trainingsaufgaben durch Vorhersagen der Solver-Pass-Rate in einem Vorwärtspass

Das Training eines einzelnen Software-Engineering-Aufgabenkandidaten kann Dutzende von Minuten in Anspruch nehmen; PROPEL ersetzt teure Lösungsanwendungen durch eine leichte Sonde, wodurch die Lernfähigkeitsgrenzaufgaben von 10,1% auf 20,0% für eine 3B-Codierung erhöht werden.

propelqwen25qwen35 27breinforcement learninglarge language modelsai training

Training a single software-engineering task candidate can take tens of minutes of solver rollouts, making simple reinforcement learning on task generators an expensive dead end. The PROPEL paper from researchers on arXiv solves this by amortizing the solver cost into a lightweight activation probe that predicts pass rate from a frozen generator's hidden states.

Solver-in-the-Loop Is Intractable at Scale

As reasoning models improve, fixed task distributions saturate. Naively generating synthetic tasks produces ones that are trivial or impossible, wasting compute. The obvious fix - train the task generator with RL to maximize validity and learnability - requires repeated solver rollouts per candidate. For SWE tasks, a single rollout can take tens of minutes; direct optimization is infeasible for any practical budget.

PROPEL breaks this cycle with a two-phase approach. First, build a one-time labeled corpus: generate tasks, run the target solver, record outcomes. Train a lightweight activation probe on top of a frozen generator reference model to predict solver pass rate from the task representation. Second, during generator optimization, replace the expensive solver with the probe - a single forward pass. No rollouts required.

Doubling the Share of Learnable Frontier Tasks

The numbers are concrete and consistent across domains. For coding with a Qwen2.5-3B-Instruct solver, the share of tasks landing at the targeted solve rate jumps from 10.1% to 20.0%. With a Qwen2.5-7B-Instruct solver, from 5.3% to 12.6% - a 2.4x relative improvement. For SWE, using Qwen3.5-27B on repositories not seen during probe or generator training, the share rises from 9.8% to 19.6%.

These aren't cherry-picked highlight numbers; the paper reports consistent gains across math, code, and SWE at multiple model scales. The probe-based proxy introduces noise, but the variance is low enough to drive the generator toward the learnable frontier reliably.

What This Enables for Training Pipelines

PROPEL solves the bottleneck that keeps task generation hand-tuned or sampled from static distributions. With the amortized probe, you can continuously adapt your task generator as the solver improves, without a multi-million-dollar rollout budget. The method is model-agnostic - it works with any transformer-based generator and any solver that outputs a binary pass/fail.

The team released the code and probe training recipes; expect to see this kind of amortized generator optimization become standard in RL training loops for agentic models. When a single forward pass replaces tens of minutes of compute, the economics of curriculum generation change overnight.


Source: Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.