AEM Pretraining Gives Robot Manipulation a Temporal Memory Edge

AEM, the Action-Effect Memory pretraining framework, learns compact temporal representations from vision-action history that consistently beat single-frame pretraining and direct frame stacking on both simulated and real robot manipulation benchmarks.

Why Single-Frame Encoding Fails in Manipulation

Most robotic pretraining methods encode a single image and call it a day. That works when the current frame tells you everything—but manipulation lives in partial observability. You can't see the object behind the gripper or the compliance in the joint. AEM targets exactly this: it models manipulation as an action-driven interaction process, interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories. The result is an action-conditioned state evolution that captures temporal structure single-frame encoding misses.

AEM’s Mamba-Encoded Temporal Bottleneck

Here's where the architecture gets clever. AEM compresses the entire vision-action history into a single-vector temporal bottleneck using Mamba encoding on the final vision token. That single vector serves as global context for both decoding and downstream control. Not a huge transformer attention matrix, not frame stacking with quadratic cost—just one compact state. The Mamba-encoded output keeps inference efficient while preserving the temporal signal. Ablation studies confirm that history-aware pretraining outperforms single-frame pretraining and frame stacking at lower latency.

Real-World Gains Over Diffusion and Flow Policies

AEM doesn't propose a new policy—it's a pretraining plug-in. The researchers evaluated it with two popular policy architectures: Diffusion Policy and Flow Policy. In simulation and real-world settings, AEM lifted performance across clean scenes, cluttered and random scenes, and non-Markovian tasks where the current observation alone is ambiguous. Baselines included single-frame pretraining and direct frame stacking; AEM beat both, and it did so with lower computational cost. For anyone building real-time robot controllers, that latency reduction is as valuable as the accuracy gain.

AEM suggests that the next big leap in robot learning won't come from bigger models or more data—it will come from better use of the temporal history we already collect.

Source: Action-Effect Memory Pretraining for Robot Manipulation
Domain: arxiv.org

AEM Pretraining Gives Robot Manipulation a Temporal Memory Edge

Why Single-Frame Encoding Fails in Manipulation

AEM’s Mamba-Encoded Temporal Bottleneck

Real-World Gains Over Diffusion and Flow Policies

More in Artificial Intelligence