Best AI Agent Scores 0.337 F1 on Scientific Conclusion Synthesis

The best AI agent synthesizing scientific conclusions from multiple sources can barely beat a coin flip on factual accuracy — a 0.337 factual F1 under a clean-room evaluation that strips out data leakage. That's the headline from SciConBench, a new benchmark released on arXiv (2606.11337) that drops 9,110 expert-written questions and conclusions from systematic reviews onto 8 frontier models and deep research agents.

Clean-Room Evaluation Exposes Leakage Inflation

SciConBench's authors built an automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To stop models from gaming the test by remembering previously seen answers, they also released SciConHarness — a controlled web-interaction harness that cuts off all prior exposure. The result: every model's performance dropped under clean-room conditions compared to standard evaluation, proving that inflated numbers in prior work came from data leakage, not true synthesis skills.

Best Agent Hits a Wall at 0.337 F1

Under constrained web interaction, the top-performing agent managed only 0.337 factual F1. That means even the best system misses about two-thirds of the correct atomic facts it should capture, or produces a mess of incorrect ones. The abstract doesn't name the winning model, but the gap between clean-room and unconstrained scores was consistent across all 8 tested agents.

Consumer-Facing Agents Are Worse

The team also audited public tools like Google AI Overview and OpenEvidence. Even when ground-truth answers were readily available in the retrieved evidence, these systems frequently generated incomplete conclusions — and sometimes outright contradictory ones. For anyone using AI to inform health or policy decisions, this is a flashing red light.

Reliable synthesis of scientific conclusions is not just unsolved; the current generation of agents can't even pass a rigorous bar set by human-curated systematic reviews. The authors are right: clean-room evaluation has to become standard before we trust any agent to summarize science for us.

Source: Can AI Agents Synthesize Scientific Conclusions?
Domain: arxiv.org

Best AI Agent Scores 0.337 F1 on Scientific Conclusion Synthesis

Clean-Room Evaluation Exposes Leakage Inflation

Best Agent Hits a Wall at 0.337 F1

Consumer-Facing Agents Are Worse

More in Artificial Intelligence