OpenAI's GeneBench-Pro Puts AI Biology Agents Through 129 Judgment-Heavy Trials

OpenAI released GeneBench-Pro today, a benchmark of 129 synthetic problems that measure whether AI agents can make the judgment calls real computational biologists face every day. It's not another recall test. GeneBench-Pro targets what the team calls “research taste” — the chain of decisions about whether a pattern is signal or noise, which question the data can support, and when an initial plan needs revising.

What “Research Taste” Actually Means

Most biology benchmarks test recall or step-following. GeneBench-Pro drops an agent into a realistic, messy dataset with a prompt and a target estimand tied to a downstream decision. The agent must explore the data, pick an analytical approach, iterate, and deliver a final answer. No single correct path exists — agents may choose different defensible cutoffs. The benchmark is built synthetically so the full causal structure is known, meaning reasonable subjective choices still produce accepted answers, while fundamentally wrong analyses fail.

Each problem gets audited for information leakage and unintended shortcuts. 82 of the 129 problems were sent to external domain experts — graduate students, postdocs, industry scientists, professors — who reviewed realism, answer identifiability, and appropriateness of methods.

Why Synthetic Data Beats Historical Benchmarks

Historical benchmarks built from real datasets carry hidden artifacts: arbitrary author preferences baked into the “correct” answer, or numerical insensitivity that lets errors slide. GeneBench-Pro sidesteps both by simulating the data-generating process. OpenAI's team tunes complexity per problem and verifies through ablation studies that plausible but incorrect analyses fail. That gives confidence that a passing score means the agent actually chose the correct analytic pathway, not that it exploited a shortcut or matched a lucky guess.

How the Benchmark Works and What It Tests

Agents get an isolated workspace with a short prompt, data files, and a standard bioinformatics stack (Python, PLINK 2.0, scientific libraries). One sample problem: from a molecular tumor board registry, estimate the marginal effect of a TXR1-directed inhibitor versus non-TXR1 therapy on week-16 clinical benefit, then compute net clinical utility = benefit risk difference − 0.35 × toxicity risk, and finally choose a therapy class code based on positive net utility. The agent must handle ambiguity, revise assumptions, and decide if the data can support the question — exactly the judgment calls that constrain AI performance in real research.

If GeneBench-Pro gains traction, expect it to become the filter that decides which agents get trusted with real experimental design and which stay in the sandbox.

Source: Introducing GeneBench-Pro
Domain: openai.com

OpenAI's GeneBench-Pro Puts AI Biology Agents Through 129 Judgment-Heavy Trials

What “Research Taste” Actually Means

Why Synthetic Data Beats Historical Benchmarks

How the Benchmark Works and What It Tests

More in Artificial Intelligence