Source linked

TabClean Compiles LLM Reasoning Into Reusable Data Cleaning Programs

arxiv.org@bold_bear2 hours ago·Machine Learning·0 comments

By synthesizing deterministic Python transformations from a small annotated set, TabClean cuts recurring LLM calls while matching or beating baselines on six benchmarks.

tabcleanllmdata cleaningpythontabular datamachine learning

Most production tables are disasters: missing values, typos, inconsistent formats, violated dependencies, unit mismatches, and ambiguous categories. LLM-based cleaners fix some of these, but they pay the inference tax on every row, every cell, every batch. TabClean slashes that cost by compiling LLM reasoning into reusable, deterministic Python programs.

TabClean needs only a small annotated development set. It profiles the dirty table, diagnoses what needs repair, then synthesizes executable Python transformations. The key abstraction is an evidence-backed guarded repair clause: a transformation fires only when its dirty pattern, target-negative condition, evidence support, and scope constraints are all satisfied.

How TabClean Avoids Per-Row LLM Calls

Existing LLM cleaners call the model repeatedly on rows or individual cells, so cost scales linearly with table size and with every re-run on new batches. TabClean replaces that repeated inference with a compiled program. Once the best candidate program is validated with cell-level feedback and committed, it runs deterministically on schema-compatible batches. That means zero extra LLM calls for recurring cleaning jobs.

Six benchmarks back this up. TabClean achieves high precision across the board and improves F1 over representative rule-based, learning-based, and direct LLM baselines on five of the six datasets. No retraining, no expert-written constraints, just one small annotation pass and a compiled program.

Guarded Repair Clauses Keep Programs Safe

A raw LLM-generated transformation could introduce errors. TabClean wraps each transformation in a guarded clause that checks four conditions before firing: the dirty pattern must match, the target must not be a known negative example, the evidence support from table profiling must be sufficient, and scope constraints (e.g., column type, value range) must hold. If any guard fails, the transformation doesn't fire. This design keeps the program safe without needing a human in the loop for every edge case.

Benchmark Results and Bottom Line

TabClean doesn't just cut costs; it competes head-to-head. Against constraint-based systems that need experts, learning-based systems that need labels and retraining, and naive LLM cleaners that pay per-row, TabClean wins on F1 in five out of six benchmarks while substantially reducing recurring runtime and API cost. The tradeoff is a small annotation effort up front, but after that the cleaning pipeline is fast, deterministic, and reusable.

What this enables: clean tabular data at scale without a per-row LLM budget. For any team running repeated data pipelines on schema-stable tables, TabClean turns a one-time LLM bill into a compiled program that runs for free.


Source: TabClean: Reusable LLM-Synthesized Programs for Tabular Data Cleaning
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.