PhoneHarness Benchmark force les agents téléphoniques au-delà du contrôle GUI Tap-and-Swipe

75% pass rate on verifiable mobile workflows sounds decent until you realise that most phone agent benchmarks never actually check if the intended side effect happened. PhoneHarness, a new mixed-action benchmark from the team behind the paper on arXiv (2606.14832), changes that by forcing agents to use GUI taps, CLI commands, and host-side tools together, then auditing whether the observable side effects actually occur.

Why GUI-Only Benchmarks Miss the Point

Current mobile-agent literature mostly evaluates agents as blind GUI controllers: observe a screen, emit taps and swipes, score by target app state. That misses the broader reality of real phone-use tasks. You often need to decide when to use an app GUI, when to run a shell command, or when to call a structured API. And you need to leave evidence that the action actually took effect, not just that the screen looked right.

PhoneHarness runs on a real device with a device-side agent loop. It combines deterministic action routing with bounded GUI delegation and auditable execution traces. The harness makes mixed workflows executable; the benchmark measures whether agents can use that harness reliably and safely.

PhoneHarness Routes Actions Across Three Surfaces

Three action surfaces: GUI (tap, swipe, type), CLI (adb shell commands, device-side scripts), and host-side tools (file operations, network calls, API invocations). The harness decides which surface to route an action to based on the agent's intent, while keeping a full audit log. No more faking a text message by just changing a pixel on the screen - the harness checks that the SMS actually left the device.

On the annotated evaluation split, PhoneHarness achieved a 75.0% pass rate. That beats the strongest non-PhoneHarness settings by 12.9 percentage points. The gain comes not from better visual perception but from the ability to mix GUI, CLI, and tool actions and to verify execution outcomes.

The 12.9 Point Gap Comes From Verifiable Execution

That gap matters because it isolates the real bottleneck in phone automation today: action-surface routing and execution verification, not GUI control alone. Agents that only predict screen actions hit a ceiling. Agents that can decide when to drop into a CLI or call a tool and then check that the side effect happened keep climbing.

PhoneHarness and PhoneHarness Bench are distinct but interdependent: the harness makes mixed phone workflows executable; the benchmark measures whether agents can use that harness reliably and safely. Expect future mobile agents to be judged not by the screens they tap but by the files they create, commands they run, and apps they configure - the same way we evaluate any real automation system.

Source: PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions
Domain: arxiv.org

PhoneHarness Benchmark force les agents téléphoniques au-delà du contrôle GUI Tap-and-Swipe

Why GUI-Only Benchmarks Miss the Point

PhoneHarness Routes Actions Across Three Surfaces

The 12.9 Point Gap Comes From Verifiable Execution

More in Artificial Intelligence