EcoTable slashes LLM invocation costs by 5x while boosting accuracy over 30% on table integration tasks, measured across four real-world benchmarks with more than 200 queries.
Why ETL Fails for Ad-Hoc Questions
Data lakes are dumping grounds for CSV and Parquet files without a unified schema. Traditional ETL forces a data engineer to predefine a target schema and build a complex pipeline - but that rigid schema rarely matches the exact join structure needed for unexpected analytical queries. The paper's authors argue this is a fundamental mismatch: you either over-invest in schema design or leave analysts stranded.
EcoTable flips the model. Instead of pre-integrating everything, it lets users ask natural language questions and dynamically assembles only the tables and joins required to answer them.
Graph Search Beats Burning Tokens on Every Candidate
The core trick is a lightweight graph where nodes are tables and edges carry weights predicted by a small deep learning model. That weight estimates how likely a pair of tables will need to be joined for the incoming query. Rather than enumerating all possible join paths with expensive LLM calls, EcoTable formulates the problem as a Steiner tree search on this weighted graph.
Three components handle the heavy lifting: (1) a table identification layer uses two-stage schema linking to pull relevant tables from the query, (2) a graph-based validation layer finds significant join paths - including bridging tables and necessary transformations - via Steiner tree search, and (3) a table transformation layer hands only those joins to an LLM to generate the transformation code. The cheap DNN pre-filter keeps the LLM from burning tokens on dead-end joins.
Real Queries, Real Savings
The team built four benchmark datasets from real-world data lake scenarios, totaling over 200 queries. Against state-of-the-art baselines, EcoTable delivers a 30%+ accuracy lift and cuts the total LLM cost by 5x. That cost reduction is the kind of number that makes a CTO pay attention: it means you can run more queries for the same budget, or route integration logic through a much cheaper model tier.
Next step: extending the framework to handle streaming data and incremental updates, so the graph doesn't need a full rebuild every time a new file lands in the lake.
Source: EcoTable: Cost-effective Table Integration in Data Lakes for Natural Language Queries
Domain: arxiv.org
Comments load interactively on the live page.