Source linked

Les agents GUI ont besoin de mémoire basée sur les cultures: les captures d'écran complètes augmentent les erreurs de fond

arxiv.org@frontier_wire2 days ago·Artificial Intelligence·10 comments

Une nouvelle étude de mode d’échec révèle que la mémoire visuelle naïve blesse les agents GUI, tandis que la croissance fondée sur l’action augmente le succès des tâches de 33,3% sur OSWorld.

gui agentsvisual memoryos worldaction grounded visual memoryfailure mode analysisartificial intelligence

Full-screenshot memory in GUI agents doesn't just fail to help; it actively increases grounding errors. A new study on OSWorld shows that switching from whole-screen memory to action-grounded image crops lifts task success rates by 33.3%.

Four Failure Modes in the Perception-Reasoning-Action Pipeline

The authors build a taxonomy of four GUI agent failures: cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error. Each maps to a distinct stage in the perception-reasoning-action loop. I've seen plenty of papers claim memory helps, but this one actually measures where it hurts.

Why Full-Image Memory Makes Action-Level Failures Worse

Prepending full screenshots to the agent's memory produces a divergent effect on the failure distribution. State-level failures drop, but action-level failures - specifically hidden operation blindness and grounding error - get worse. Storing everything the agent saw means the model can't distinguish what matters for the next action.

AGMem: Cropping to What Matters

Action-Grounded Visual Memory (AGMem) replaces full screenshots with image crops that capture only the local GUI region tied to a successful action or recovery. This forces the agent to focus on the relevant spatial context. On OSWorld, AGMem achieved a 33.3% higher task success rate than the full-image baseline. The result is clean: remove visual noise, remove failure amplification.

This work gives practitioners a concrete rule of thumb: when building experiential memory for GUI agents, crop aggressively to the action footprint. Expect to see subsequent papers treat full-screenshot storage as a known anti-pattern rather than a default choice.

Source: Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

Dr-DCI Hits 73.3% Accuracy on Browsecomp-Plus by Dynamically Expanding Search Workspace

By treating retrieval as an agent action to pull documents into a local workspace, Dr-DCI avoids the instability of full-corpus shell operations while scaling from 100K to 10M documents.

When Models Disagree, Route to a Different Model: Video QA Gains 1.81 Points

Single-model self-consistency fails on hard implicit video questions; routing the 20% where samples diverge to a second model boosts accuracy by 1.43-1.81 points, with motion and counting categories gaining 5+ points.

RAMS Dynamically Switches YOLOv8 Tiers to Cut Latency 5.6x on Embedded Edge

RAMS drops inference latency from ~19 ms to 3.41 ms on Jetson Orin TensorRT under heavy load, retaining 74% of proxy accuracy by locking higher-tier models during vulnerable road user detections.

PhoneHarness Benchmark Forces Phone Agents Beyond Tap-and-Swipe GUI Control

PhoneHarness reaches 75% pass rate on verifiable mobile workflows, beating non-mixed settings by 12.9 points by routing agents across GUI, CLI, and tool actions.

Comments load interactively on the live page.