Source linked

23.2M Biomedical Abstracts Now Structured via LLM Pipeline

Structured PubMed labels 17.2 million formerly unstructured abstracts with a unified five-section schema, enabling PubMed-wide text mining at scale.

structured pubmedpubmedbiomedical literaturellm pipelinetext mininglarge language models

23.2 million biomedical abstracts — 17.2 million of which were previously unstructured — now sit in a unified five-section schema thanks to Structured PubMed.

That’s every research-article abstract in PubMed, not a sample.

17.2 Million Abstracts Labeled by an LLM, Not Human Annotators

Authors at the intersection of NLP and biomedicine built two subsets: 5.9 million author-structured abstracts parsed directly from official PubMed XML files, and 17.2 million originally unstructured abstracts fed through a verbatim-extraction Large Language Model pipeline. The LLM doesn’t rewrite; it labels each sentence with a section heading drawn from a standard five-section schema (Introduction, Methods, Results, Discussion, Conclusion or equivalent). The result is a flat, machine-readable corpus at a scale no curation effort could touch.

Every record retains its original PubMed identifier, publication type, and date. That means downstream tools can join this structured data back to any metadata or full-text resource without friction.

Why a Unified Schema Matters for Biomedical Text Mining

Information retrieval and knowledge synthesis on PubMed have always been bottlenecked by the mess of unstructured abstracts. Sentence-classification models needed hand-labeled training data for each new task. Text-segmentation architectures had no consistent ground truth at scale. Section-specific extraction — pulling only the Methods sentence about sample size, for example — required bespoke heuristics.

Structured PubMed eliminates that grunt work. You can train a sentence classifier on 23.2 million labeled sentences, benchmark your segmentation model against the 5.9 million gold-standard abstracts, or extract every mention of a compound from the Results sections alone. The dataset is already parsed and harmonized; the only step left is building your application on top.

What This Unlocks Next

Expect to see off-the-shelf models fine-tuned on this corpus for literature search, systematic review automation, and clinical decision support. The authors provide the dataset under open terms, so any lab or company with a GPU can start tomorrow. The era of having to restructure PubMed before running your experiment is over.


Source: A PubMed-Scale Dataset of Structured Biomedical Abstracts
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.