A single global prompt in a vision-language model reward function is near-flat for much of a long-horizon robot trajectory. That makes early progress invisible to the agent, and it is why RMTL matters.
The Flat Reward Problem
Using pretrained VLMs as zero-shot reward generators saves you from hand-crafting dense reward functions or collecting human demonstrations. But the tradeoff is a single text prompt like "pick and place the object" applied across the whole episode. For short tasks that works. For long-horizon manipulation with randomized initial conditions, the VLM reward signal is essentially flat until the agent gets close to the goal. The agent cannot tell if it is making any progress during the first 80% of the trajectory. RL with flat rewards is basically random search.
Micro-Task Decomposition with Multi-View VLMs
RMTL cuts the problem into a small set of language-described micro-tasks. For the Fetch environment the authors use three short stage-specific prompts - one per sub-stage of the manipulation. At each step the agent receives a multi-view VLM reward computed using the prompt of the currently active micro-task. Averaging across multiple camera views reduces view-specific occlusions. That gives the agent a meaningful gradient to follow at every timestep, not just near the final goal.
Reverse Curriculum and Hierarchical Manager
A reverse curriculum gradually exposes the agent to harder initial conditions, preventing the usual cold-start failure. The low-level PPO worker is first trained with a fixed distance-based rule that selects the active micro-task. Once that works, the rule is replaced with a learned hierarchical manager that decides which micro-task prompt to use at each step. The whole thing becomes a fully learned hierarchical policy without any hand-coded phase switching. No additional prompt tuning is needed for the VLM.
Experiments on the Fetch manipulation benchmark show RMTL provides more informative reward signals and enables faster learning than the single-prompt baseline. The improvement comes directly from decomposing the VLM reward - not from a bigger model or more data.
RMTL's architecture points toward a future where language-guided RL works for real-world tasks where you cannot hand-design a reward but a few natural-language sub-goals are easy to specify.
Source: RMTL: Reinforced Micro-task Learning for Long-Horizon Manipulation with VLM Rewards
Domain: arxiv.org
Comments load interactively on the live page.