DocArena-79K: 8,336 Documents, 49 Languages, Zero Human Annotation for Search Agent Training

A fully automated pipeline uses multimodal LLMs to structure raw documents and generate reasoning-intensive QA pairs, beating prior methods on six multimodal search scenarios.

docarenaarxivmultimodal large language modelssearch agentsreinforcement learningretrieval augmented generation

DocArena-79K serves up 8,336 raw documents from 16 domains and 49 languages as a training environment for document search agents - and no human touched a single label.

Zero-Annotation Pipeline with Vision Backbone

Most search agent training relies on hand-curated (question, answer, evidence) tuples. DocArena throws out that bottleneck. The pipeline uses a multimodal large language model (MLLM) for visual perception to structure and index raw documents, then profiles cross-page information distribution to build reasoning-intensive QA pairs. A cascaded quality assurance step, also powered by MLLM, filters the output. Result: 79,000 QA pairs with zero human annotation.

Decoupled Architecture for Multimodal Search

The paper's Doc-Search agent separates visual perception from the policy model. A text-based LLM acts as the reasoning backbone for retrieval and QA, while a separate vision module handles document layout and imagery. That decoupling lets any text-only LLM step into the role of the search agent without needing multimodal weights itself.

Beats Prior Methods Across 13 Benchmarks

Under a unified evaluation framework where only the policy model varies, agents trained on DocArena data achieve the best performance on both retrieval accuracy and QA quality. The experiments cover six multimodal document scenarios (think scanned forms, invoices, academic PDFs) and seven text-based QA benchmarks. The paper reports that the constructed training environment gives controllable agent search behaviors - making it useful not just for performance, but for understanding how search strategies develop.

DocArena opens a path to building search agents from any document collection without manual curation, turning MLLM perception into a data flywheel.

Source: DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents
Domain: arxiv.org

DocArena-79K: 8,336 Documents, 49 Languages, Zero Human Annotation for Search Agent Training

Zero-Annotation Pipeline with Vision Backbone

Decoupled Architecture for Multimodal Search

Beats Prior Methods Across 13 Benchmarks

More in Artificial Intelligence