Verifier-Driven Research: Glite ARF Slashes RMSE 35.9% With 12 Parallel Agents at $450

273 tracked tasks, 146 experiment runs, 129 feature sets, 12 parallel agents orchestrated from a single laptop, and a total third-party cost of $498—that's the tally from Glite ARF's winning run at the BEA 2026 vocabulary-difficulty shared task. The framework slashed baseline RMSE by 29.9% in the closed track and 35.9% in the open track across Spanish, German, and Mandarin. And it caught four target-leaking feature sets that would have produced a fraudulent 0.609 RMSE, correcting it to a still-strong 0.802.

The Three-Role Stack That Makes Parallel Agents Scale

Glite ARF is an open-source Python framework that operationalizes what its authors call "verifier-driven research." Instead of asking LLM agents to follow a prose research protocol (which they inevitably violate at low rates that compound into broken artefacts), the framework codifies the rules as deterministic Python verifier scripts. These scripts enforce task isolation, immutability of completed work, a corrections overlay, and a materialized project overview—they fail loudly, not quietly.

The stack has three roles: a human researcher chooses hypotheses, coding agents (Claude Code and Codex CLI) implement individual tasks within a fixed structure, and the verifier scripts enforce the boundaries. This architectural constraint means you can throw multiple agents at a research repository—up to twelve in this campaign—without sacrificing reproducibility or auditability.

Real Dollars, Real Provenance

Total LLM API spend ran about $450; the full third-party bill (including some model training on rented A100s) hit $498. The framework's structural machinery adds roughly 1% wall-clock overhead across three campaigns in three domains. Per-fold provenance let the team spot and strip four target-leaking feature sets that would have made their results look implausibly good—a concrete example of why deterministic verification beats agentic compliance.

Glite ARF ships with a public demo project and its own paper. If you're running empirical research with LLM coding agents and want to avoid the death-by-a-thousand-low-rate-violations spiral, this framework gives you the fail-fast enforcement that prose instructions never will.

Source: Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents
Domain: arxiv.org

Verifier-Driven Research: Glite ARF Slashes RMSE 35.9% With 12 Parallel Agents at $450

The Three-Role Stack That Makes Parallel Agents Scale

Real Dollars, Real Provenance

More in Machine Learning