Full-screenshot memory in GUI agents doesn't just fail to help; it actively increases grounding errors. A new study on OSWorld shows that switching from whole-screen memory to action-grounded image crops lifts task success rates by 33.3%.
Four Failure Modes in the Perception-Reasoning-Action Pipeline
The authors build a taxonomy of four GUI agent failures: cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error. Each maps to a distinct stage in the perception-reasoning-action loop. I've seen plenty of papers claim memory helps, but this one actually measures where it hurts.
Why Full-Image Memory Makes Action-Level Failures Worse
Prepending full screenshots to the agent's memory produces a divergent effect on the failure distribution. State-level failures drop, but action-level failures - specifically hidden operation blindness and grounding error - get worse. Storing everything the agent saw means the model can't distinguish what matters for the next action.
AGMem: Cropping to What Matters
Action-Grounded Visual Memory (AGMem) replaces full screenshots with image crops that capture only the local GUI region tied to a successful action or recovery. This forces the agent to focus on the relevant spatial context. On OSWorld, AGMem achieved a 33.3% higher task success rate than the full-image baseline. The result is clean: remove visual noise, remove failure amplification.
This work gives practitioners a concrete rule of thumb: when building experiential memory for GUI agents, crop aggressively to the action footprint. Expect to see subsequent papers treat full-screenshot storage as a known anti-pattern rather than a default choice.
Source: Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents
Domain: arxiv.org
Comments load interactively on the live page.