QPILOTS يدير سياسات التنقل إلى 90٪ من النجاح دون تدريب جديد

تم تصميم طريقة وقت التفكير الجديدة لممارسة العمليات الضوئية لإزالة التقديرات قبل الحساب المرتفعات الإيجابية ، مما أدى إلى 90٪ من النجاح المتوسط في 50 مهام RL عبر الإنترنت وتقييم نموذج VLA المزدوج.

qpilotsflow policiesdiffusion policiesreinforcement learningoffline to online rlvision language action model

90% average success across 50 offline-to-online RL tasks: that's what QPILOTS delivers, and it does so without retouching the original policy. The method leaves frozen flow-matching or diffusion policies intact and steers the denoising process at inference time. No distillation, no multi-step backprop through unstable gradients.

Critic Gradients on Clean Action Projections

The core problem with steering diffusion policies is that intermediate noisy actions produce unreliable critic predictions. QPILOTS side-steps this by first projecting the noisy intermediate state to an estimate of the final clean action, then computing the critic gradient on that estimate. This lets the method exploit the critic's action gradient without the numerical instability of backpropagating through the full multi-step denoising chain.

Two variants handle that projection differently. QPILOTS-U uses a fast single-point approximation that trades some accuracy for speed. QPILOTS-M draws differentiable posterior samples via a learned auxiliary network, giving higher fidelity at the cost of a small extra model. Both keep the original policy frozen.

Benchmarks Across 50 Tasks and a Frozen VLA

On a standard offline-to-online RL benchmark suite, QPILOTS achieves the best aggregate performance across 50 tasks with that 90% success rate. That's a direct comparison against methods that discard gradient information, distill into one-step actors, or repeatedly fine-tune the denoising policy. QPILOTS beats all of them without any policy modification.

The same approach also steers a large, frozen, pretrained Vision-Language Action (VLA) foundation model. Across six manipulation tasks in simulation, QPILOTS outperforms or matches prior inference-time steering approaches. For anyone running a large VLA model that's too expensive to fine-tune, this is a practical way to inject task-specific Q-values at test time.

QPILOTS turns the critic from a training-only signal into a runtime controller for diffusion-based policies, opening up test-time adaptation without touching the base model weights.

Source: QPILOTS: Efficient Test-Time Q-Steering for Flow Policies
Domain: arxiv.org

QPILOTS يدير سياسات التنقل إلى 90٪ من النجاح دون تدريب جديد

Critic Gradients on Clean Action Projections

Benchmarks Across 50 Tasks and a Frozen VLA

More in Machine Learning