Source linked

Les agents GUI ont besoin de mémoire basée sur les cultures: les captures d'écran complètes augmentent les erreurs de fond

Une nouvelle étude de mode d’échec révèle que la mémoire visuelle naïve blesse les agents GUI, tandis que la croissance fondée sur l’action augmente le succès des tâches de 33,3% sur OSWorld.

gui agentsvisual memoryos worldaction grounded visual memoryfailure mode analysisartificial intelligence

Full-screenshot memory in GUI agents doesn't just fail to help; it actively increases grounding errors. A new study on OSWorld shows that switching from whole-screen memory to action-grounded image crops lifts task success rates by 33.3%.

Four Failure Modes in the Perception-Reasoning-Action Pipeline

The authors build a taxonomy of four GUI agent failures: cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error. Each maps to a distinct stage in the perception-reasoning-action loop. I've seen plenty of papers claim memory helps, but this one actually measures where it hurts.

Why Full-Image Memory Makes Action-Level Failures Worse

Prepending full screenshots to the agent's memory produces a divergent effect on the failure distribution. State-level failures drop, but action-level failures - specifically hidden operation blindness and grounding error - get worse. Storing everything the agent saw means the model can't distinguish what matters for the next action.

AGMem: Cropping to What Matters

Action-Grounded Visual Memory (AGMem) replaces full screenshots with image crops that capture only the local GUI region tied to a successful action or recovery. This forces the agent to focus on the relevant spatial context. On OSWorld, AGMem achieved a 33.3% higher task success rate than the full-image baseline. The result is clean: remove visual noise, remove failure amplification.

This work gives practitioners a concrete rule of thumb: when building experiential memory for GUI agents, crop aggressively to the action footprint. Expect to see subsequent papers treat full-screenshot storage as a known anti-pattern rather than a default choice.


Source: Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.