Source linked

Guava's Universal Harness Unlocks Embodied Manipulation in Compact 4B Models

A systematic design exploration reveals three key ingredients for embodied tool use; a distilled 4B model trained on fewer than 2K simulated trajectories performs comparably to frontier proprietary systems on...

guavaembodied manipulationroboticslarge language modelsdistillationsimulation

A 4B open-source model, trained on fewer than 2,000 simulated trajectories, matches frontier proprietary systems on real-world manipulation tasks thanks to Guava, a harness framework built on three design principles. The Guava team systematically explored agent workflows, action spaces, and observation spaces to find what actually makes an embodied agent work.

Three Ingredients That Matter

Guava identifies three non-negotiable ingredients: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. Skip any one, and performance tanks. These aren't theoretical - they're the result of a design space search that treats the harness as a first-class component, separate from the reasoning model itself.

Semantic actions are key. Instead of controlling joints or end-effector poses directly, Guava operates on high-level commands like "pick" or "place." That abstraction lets even small models reason about goals without drowning in low-level geometry. The iterative loop ensures the model can correct mistakes by re-observing the scene after each action.

Distillation Beats End-to-End Training

Most embodied approaches try to learn a monolithic vision-language-action policy. Guava takes the opposite route: train a harness that externalizes perception, planning, and control, then distill the reasoning into a 4B model using only 2,000 simulated trajectories. No real-world demos needed.

Results in both simulation and physical robots show Guava's 4B model performing comparably to frontier proprietary systems (e.g., GPT-4V with action wrappers). It generalizes to unseen objects, novel language instructions, and long-horizon tasks that require multiple steps and recovery from failure. The team reports strong emergent capabilities - meaning the harness architecture, not model size, drives performance.

What This Unlocks

Guava demonstrates that a well-designed harness is model-agnostic and scalable. Compact open-source models can now tackle embodied manipulation without expensive real-world data collection or massive compute. Expect more teams to adopt this harness approach, because the bottleneck isn't the brain - it's the body's interface.


Source: Guava: An Effective and Universal Harness for Embodied Manipulation
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.