DocArena-79K serves up 8,336 raw documents from 16 domains and 49 languages as a training environment for document search agents - and no human touched a single label.
Zero-Annotation Pipeline with Vision Backbone
Most search agent training relies on hand-curated (question, answer, evidence) tuples. DocArena throws out that bottleneck. The pipeline uses a multimodal large language model (MLLM) for visual perception to structure and index raw documents, then profiles cross-page information distribution to build reasoning-intensive QA pairs. A cascaded quality assurance step, also powered by MLLM, filters the output. Result: 79,000 QA pairs with zero human annotation.
Decoupled Architecture for Multimodal Search
The paper's Doc-Search agent separates visual perception from the policy model. A text-based LLM acts as the reasoning backbone for retrieval and QA, while a separate vision module handles document layout and imagery. That decoupling lets any text-only LLM step into the role of the search agent without needing multimodal weights itself.
Beats Prior Methods Across 13 Benchmarks
Under a unified evaluation framework where only the policy model varies, agents trained on DocArena data achieve the best performance on both retrieval accuracy and QA quality. The experiments cover six multimodal document scenarios (think scanned forms, invoices, academic PDFs) and seven text-based QA benchmarks. The paper reports that the constructed training environment gives controllable agent search behaviors - making it useful not just for performance, but for understanding how search strategies develop.
DocArena opens a path to building search agents from any document collection without manual curation, turning MLLM perception into a data flywheel.
Source: DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents
Domain: arxiv.org
Comments load interactively on the live page.