Source linked

GUI Agents Need Crop-Based Memory: Full Screenshots Increase Grounding Errors

A new failure-mode study reveals that naive visual memory hurts GUI agents, while action-grounded cropping lifts task success by 33.3% on OSWorld.

gui agentsvisual memoryos worldaction grounded visual memoryfailure mode analysisartificial intelligence

Full-screenshot memory in GUI agents doesn't just fail to help; it actively increases grounding errors. A new study on OSWorld shows that switching from whole-screen memory to action-grounded image crops lifts task success rates by 33.3%.

Four Failure Modes in the Perception-Reasoning-Action Pipeline

The authors build a taxonomy of four GUI agent failures: cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error. Each maps to a distinct stage in the perception-reasoning-action loop. I've seen plenty of papers claim memory helps, but this one actually measures where it hurts.

Why Full-Image Memory Makes Action-Level Failures Worse

Prepending full screenshots to the agent's memory produces a divergent effect on the failure distribution. State-level failures drop, but action-level failures - specifically hidden operation blindness and grounding error - get worse. Storing everything the agent saw means the model can't distinguish what matters for the next action.

AGMem: Cropping to What Matters

Action-Grounded Visual Memory (AGMem) replaces full screenshots with image crops that capture only the local GUI region tied to a successful action or recovery. This forces the agent to focus on the relevant spatial context. On OSWorld, AGMem achieved a 33.3% higher task success rate than the full-image baseline. The result is clean: remove visual noise, remove failure amplification.

This work gives practitioners a concrete rule of thumb: when building experiential memory for GUI agents, crop aggressively to the action footprint. Expect to see subsequent papers treat full-screenshot storage as a known anti-pattern rather than a default choice.


Source: Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.