STRUCTSURVEY Boosts Survey Paper ROUGE by +2.9 by Shifting Reasoning to Retrieval

ROUGE-1 recall jumps by +2.9 and ROUGE-2 by +1.0 on average, with zero precision loss, when you replace embedding-only survey retrieval with explicit graph-based structural reasoning. That's the headline from a new arXiv paper that guts the standard approach to automated survey generation.

Why Graph-Based Retrieval Beats Embedding-Only Searches

Existing LLM-based survey generators retrieve unstructured text and then force the model to infer conceptual relationships, methodological connections, and taxonomic hierarchies on the fly during generation. That's backward — you're asking a decoder to do structural reasoning without any structural scaffolding. The authors built STRUCTSURVEY, a hierarchical multi-agent framework that flips that: it dynamically constructs a graph of entities, relations, and topical taxonomies during retrieval. The reasoning happens there, not inside the LLM's context window.

The benchmark is new too: a reference-grounded set of ACL survey papers designed for reproducible long-form scientific summarization. Compared against vanilla dense retrieval baselines, the improvement is consistent across both ROUGE metrics and a more qualitative LLM-as-a-Judge evaluation. The judges scored STRUCTSURVEY higher on logical structure, depth, and synthesis — meaning the output actually reads like a human-organized survey, not a stream-of-consciousness dump.

How the Multi-Agent Framework Works

The core idea is agentic retrieval. Instead of one monolithic retriever, multiple agents specialize: one extracts named entities, another maps out relation triples, a third builds the taxonomy tree. These agents coordinate to produce a structured knowledge graph that the final summarization agent consumes. The shift from “reasoning at generation time” to “reasoning at retrieval time” is what lets the system cite and organize more faithfully — no more hallucinating connections that the raw text never supported.

Crucially, the precision didn't drop. Adding structure usually means trading recall for precision, but here both move in the right direction because the graph acts as a filter, not a bottleneck. Every retrieved chunk is factually grounded in the source papers via the graph edges.

What This Means for Scientific Literature Mining

Survey papers are the lifeblood of fast-moving fields like AI, but keeping them current is a human bottleneck. STRUCTSURVEY won't replace the domain expert who decides what's important, but it demonstrates that you can automate the structural organization that typically separates a good survey from a mediocre keyword-stuffed one. The next step is combining this graph retrieval with multi-document citation tracking and live preprint feeds — imagine a survey that updates itself weekly with new papers slotted into the right sub-topic without regenerating the whole thing.

Source: STRUCTSURVEY: Structured Agentic Retrieval for Automated Survey Paper Generation
Domain: arxiv.org

STRUCTSURVEY Boosts Survey Paper ROUGE by +2.9 by Shifting Reasoning to Retrieval

Why Graph-Based Retrieval Beats Embedding-Only Searches

How the Multi-Agent Framework Works

What This Means for Scientific Literature Mining

More in Artificial Intelligence