Source linked

Self-Evolving Data Prep System DataEvolver schlägt manuelle Kuration um 10%

Ein neues System namens DataEvolver baut und verfeinert autonome Datenvorbereitungspipelines, was zu einer durchschnittlichen Verbesserung der nachgelagerten LLM-Leistung über sieben Benchmarks ohne menschliche Intervention führt.

dataevolverlarge language modelsdata preparationself evolving systemsautomatic data pipelinemachine learning

DataEvolver delivers a 10% average gain on seven LLM benchmarks by automating the entire data preparation pipeline — a task that normally requires months of manual curation.

Most automatic data preparation methods today rely on fixed pipelines or hand-written instructions that break when data distributions shift. The authors of the DataEvolver paper took a different bet: let the system build its own pipeline from scratch, then iterate on it.

How DataEvolver Bypasses Manual Curation

DataEvolver operates at two levels. At the operator level, it starts with a minimal set of data transformations and incrementally expands them into a logical plan, resolving dependency conflicts as it goes. At the pipeline level, it converts that logical plan into executable code, then runs a feedback loop that measures the distribution gap between the prepared data and a small set of high-quality examples.

That feedback loop is the key. Rather than guessing which transformations work, DataEvolver directly minimizes the statistical distance between its output and the curated gold standard. It keeps iterating until the gap shrinks below a threshold.

Operator-Level and Pipeline-Level Self-Evolution

The multi-level mechanism solves two problems simultaneously. First, it ensures the pipeline is actually executable — no missing imports, no circular dependencies, no dead ends. Second, it optimizes for effectiveness: the prepared data drives better downstream LLM performance.

You don't need a data engineering team to redesign the pipeline for every new dataset. DataEvolver adapts automatically to the distribution of the raw input, whether it's web crawl dumps, domain-specific corpora, or synthetic data.

What the Benchmarks Show

The team tested DataEvolver on seven benchmarks covering code generation, mathematics, and general reasoning. Training on data prepared by DataEvolver produced an average 10% improvement in downstream LLM accuracy compared to training on the original unfiltered data.

That 10% isn't a tweak — it's the difference between a model that barely passes and one that reliably solves the task. And the gain comes without any additional human annotation budget.

DataEvolver points toward a future where data and models co-evolve: the model's own outputs could feed back into the data preparation pipeline, closing the loop between training data quality and model capability.


Source: DataEvolver: Automatic Data Preparation for Large Language Models through Multi-Level Self-Evolving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.