SCORE improves average success rate from 37.8% to 89.9% across eight dexterous multi-fingered manipulation tasks, using only simulation for RL — no real-world training required. The best baseline managed 59.5%. That’s a 52.1 percentage point jump over the base policy, and SCORE gets there in 36.8% fewer steps.
Most real-world robot policies are imprecise, slow, and brittle. Simulation-based RL is cheap but dangerous: unconstrained optimization exploits contact and dynamics mismatches, producing behaviors that fail on hardware. Standard regularization clamps the policy to an imperfect prior, capping improvement.
Flow Steering to the Rescue
The Weird Lab UW team behind SCORE — short for Support-Constrained Off-Domain REinforcement — solves this by constraining the RL policy to stay within the support of a generative policy pretrained on real data. They implement this constraint via flow steering, which restricts the RL agent to actions the base policy can already produce. The result: the policy improves without drifting into untransferable contact exploit or unsafe gaits.
SCORE learns from sparse rewards, avoids distillation, and leaves the base policy untouched. That’s minimal engineering overhead for a massive gain.
Real-to-Sim-to-Real, Not Sim-to-Real
The architecture flips the usual sim-to-real pipeline. Instead of training in simulation and hoping it transfers, SCORE takes a real-world policy, moves it to simulation for RL under a support constraint, and deploys the improved policy back to hardware. The constraint is the linchpin: it prevents the simulation training from chasing unrealistic contact regimes that would fail on the actual robot.
Eight manipulation tasks, from peg insertion to cloth folding, tested on a dexterous multi-fingered hand. Ablations show that removing the support constraint collapses success rates, confirming that unconstrained simulation RL is worse than doing nothing.
This is the first demonstration that simulation can substantially improve real-world manipulation policies when the optimization space is properly bounded. Expect this real-to-sim-to-real paradigm to show up in every robotics lab that’s tired of brittle, hand-tuned policies.
Source: Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience
Domain: arxiv.org
Comments load interactively on the live page.