OpenAI dropped 19,020 rubric criteria on the table with LifeSciBench, a benchmark that doesn't ask AI to recall facts but to behave like a Ph.D.-level collaborator. That's an average of 25 granular scoring criteria per task, pulled from 453 expert reviewers who demanded at least 90% agreement before a task was accepted.
Most biology benchmarks test narrow skills: answer a multiple-choice question, predict a binding affinity, recall a pathway. LifeSciBench throws that out. Its 750 tasks span seven workflows - evidence handling, analysis, design, reasoning, validation, translation, communication - across seven biological domains. 79% of tasks require multiple reasoning steps, averaging four steps per task. And 53% of tasks demand the model interpret figures, PDFs, tables, sequence files, or chemical structures, not just read prompt text.
Why 19,020 Rubric Criteria Matter
A single rubric break-down might check a conclusion, then check whether the model flagged an assay limitation, then verify it mentioned a biological nuance that could tank the approach. That's the difference between a correct answer and a useful one. OpenAI's scientists designed each task as a request you'd give a knowledgeable colleague - something like "critique our FDA meeting package for a micro-dystrophin gene therapy." No clean reference answer. The rubric scores the reasoning, the caveats, the operational soundness.
173 scientists with Ph.D.-level training and industry biotech experience authored the tasks. Each task went through at least two rounds of expert review and averaged six self-directed automated review cycles. That's a lot of vetting for a dataset that explicitly targets the fuzzy, high-stakes judgments in drug discovery.
What LifeSciBench Actually Tests
LifeSciBench measures whether an AI can handle incomplete evidence, reconcile conflicting results, design tough experiments, and decide what to do next under uncertainty. One example: preparing for a Type B FDA meeting on an AAV9-based micro-dystrophin gene therapy, then delivering a hard-nosed critique of the package's support for a surrogate endpoint. The model must interpret pre- and post-treatment biopsy data, Western blot images, and regulatory context - then produce a response that a real scientist would find useful.
This is not a benchmark you can game with retrieval-augmented generation alone. The model has to reason over supporting files, not just text. It has to exercise scientific judgment. OpenAI is betting that the kind of rigor needed to score well on LifeSciBench correlates with real-world research competence.
The Data Pipeline Behind the Scores
Every accepted task required at least 90% inter-reviewer agreement anchored in either a verifiable answer or strong expert consensus. That filter killed ambiguity. The resulting 1,062 task artifacts - figures, tables, PDFs, structure files, web references - come with rubrics that make the evaluation reproducible and transparent. If you're building an agentic AI for drug discovery, this is the eval you should be optimizing on, not MMLU-biology.
LifeSciBench is open, and the paper details the taxonomy, construction process, and grading methodology. The next step is seeing how today's frontier models perform against these 25-criteria rubrics. I'd bet most fall short on the caveats and biological nuance that the reviewers demand. That's exactly the point - a benchmark that forces AI to earn its place at the lab bench.
Source: Introducing LifeSciBench
Domain: openai.com
Comments load interactively on the live page.