16 out of 24 embodied VLM benchmarks — that’s how many state-of-the-art scores Embodied-R1.5 snatches from models like Gemini-Robotics-ER-1.5 and GPT-5.4. And it does it with only 8 billion parameters.
15B Tokens of Embodied Data, One Architecture
The team behind Embodied-R1.5 built three automated data construction pipelines to cover the capabilities that matter: embodied cognition, task planning, correction, and pointing. That pipeline churned out over 15 billion tokens of training data. To handle the resulting heterogeneity across tasks, they designed a multi-task balanced RL recipe instead of standard reward hacking.
Planner-Grounder-Corrector: Closed-Loop Autonomy
Single models usually choke on long-horizon tasks because they can’t self-correct. Embodied-R1.5 wraps its reasoning loop into a Planner-Grounder-Corrector (PGC) framework. The same model plans, grounds its actions, detects failure, and replans — no separate modules, no external verifier. In zero-shot real-robot experiments, it handled instruction following, affordance grounding, articulated object manipulation, and complex long-horizon tasks without task-specific tuning.
VLA Fine-Tuning with Pocket Change
Because the base model internalizes embodied reasoning, you can turn Embodied-R1.5 into a Vision-Language-Action policy with a surprisingly small amount of data. The resulting VLA outperforms leading models like $\pi_{0.5}$ across four popular manipulation benchmark suites. No need to train from scratch or collect millions of demonstrations.
Open Source, Not a Press Release
The authors released model weights, the full training dataset, training code, and a new evaluation framework called EmbodiedEvalKit designed specifically for embodied tasks. If you’re building a robot that needs to reason, plan, and recover from mistakes, you now have a concrete baseline to beat — and the data to beat it with.
Source: Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
Domain: arxiv.org
Comments load interactively on the live page.