Source linked

EcoTable Cuts LLM Costs 5x While Boosting Table Integration Accuracy 30%

A new framework uses a graph-based search and lightweight deep learning model to pick the right join paths, slashing expensive LLM calls while delivering better results on 200+ benchmark queries.

ecotablellmsdata lakesetltable integrationnatural language queries

EcoTable slashes LLM invocation costs by 5x while boosting accuracy over 30% on table integration tasks, measured across four real-world benchmarks with more than 200 queries.

Why ETL Fails for Ad-Hoc Questions

Data lakes are dumping grounds for CSV and Parquet files without a unified schema. Traditional ETL forces a data engineer to predefine a target schema and build a complex pipeline - but that rigid schema rarely matches the exact join structure needed for unexpected analytical queries. The paper's authors argue this is a fundamental mismatch: you either over-invest in schema design or leave analysts stranded.

EcoTable flips the model. Instead of pre-integrating everything, it lets users ask natural language questions and dynamically assembles only the tables and joins required to answer them.

Graph Search Beats Burning Tokens on Every Candidate

The core trick is a lightweight graph where nodes are tables and edges carry weights predicted by a small deep learning model. That weight estimates how likely a pair of tables will need to be joined for the incoming query. Rather than enumerating all possible join paths with expensive LLM calls, EcoTable formulates the problem as a Steiner tree search on this weighted graph.

Three components handle the heavy lifting: (1) a table identification layer uses two-stage schema linking to pull relevant tables from the query, (2) a graph-based validation layer finds significant join paths - including bridging tables and necessary transformations - via Steiner tree search, and (3) a table transformation layer hands only those joins to an LLM to generate the transformation code. The cheap DNN pre-filter keeps the LLM from burning tokens on dead-end joins.

Real Queries, Real Savings

The team built four benchmark datasets from real-world data lake scenarios, totaling over 200 queries. Against state-of-the-art baselines, EcoTable delivers a 30%+ accuracy lift and cuts the total LLM cost by 5x. That cost reduction is the kind of number that makes a CTO pay attention: it means you can run more queries for the same budget, or route integration logic through a much cheaper model tier.

Next step: extending the framework to handle streaming data and incremental updates, so the graph doesn't need a full rebuild every time a new file lands in the lake.


Source: EcoTable: Cost-effective Table Integration in Data Lakes for Natural Language Queries
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.