Source linked

LasRepair++ reduce los errores de reparación de datos en un 18,1% utilizando la colaboración LLM-SLM

Un marco colaborativo combina un modelo de lenguaje grande como instructor con un modelo pequeño como corrector, aumentando la puntuación de F1 en un 18,1% sobre la base más fuerte en los conjuntos de datos del mundo real.

large language modelssmall language modelsdata repairdata cleaninglasrepairarxiv

18.1% average F1-score improvement over the strongest baseline is what LasRepair++ delivers on real-world data repair benchmarks. That's not incremental tuning, that's a different approach to the problem.

Why Direct LLM Repair Falls Short

Two things break when you throw a raw dataset at an LLM and ask it to fix errors. First, the model sees the corrupted data in its original, low-quality context, which poisons the learning signal. Second, the model outputs are uncertain, but most methods treat every repair as equally reliable, injecting noise into the final dataset. Both issues compound: bad context leads to bad repairs, then bad repairs get used as ground truth for the next round.

LasRepair's Two-Model Architecture

LasRepair splits the workload deliberately. A large language model acts as an instructor, selecting a global repair context from the dataset that is cleaner and more representative. That context gets handed to a small language model, which acts as the efficient corrector. The small model does the actual row-level repairs using the curated context, not the raw mess. This separation lets the LLM do the expensive reasoning once, while the SLM executes cheaply at scale.

EM and Confidence Weighting Push Further

The authors extend LasRepair in two concrete steps. LasRepair+ formulates data repair as an Expectation-Maximisation procedure, alternating between updating the corrector parameters (E-step) and refining the repair context (M-step). Theoretical proof in the paper shows this EM-style loop converges. LasRepair++ goes one step further by introducing column-calibrated model confidence. Unreliable repaired rows get down-weighted when updating the corrector, so low-confidence fixes don't corrupt the learning signal. That's the mechanism behind the 18.1% gain.

Real-world datasets don't lie: the combination of context selection, EM iteration, and confidence weighting systematically outperforms every baseline. The pattern of LLM as instructor and SLM as scalable corrector is general enough to apply beyond data cleaning, to any task where you need accuracy on a budget.


Source: Collaborative Large and Small Language Models for Accurate and Scalable Data Repair
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.