Spider 2.0-Snow's largest database subset dumps 2.6 million tokens into the prompt—DBCC squeezes that down to 34,700 tokens without losing what the SQL generator actually needs.
The Real Bottleneck Isn't Reasoning—It's Database Representation
Three recent Text-to-SQL systems already leverage strong language models and clever prompting, yet execution accuracy on real-world benchmarks like Spider 2.0 and BIRD still lags far behind academic datasets. The authors of DBCC argue the bottleneck has shifted: reasoning is no longer the limiting factor. Enterprise databases carry repeated audit columns, tables that are near-copies of each other, opaque IDs whose meanings live only in documentation, and bloated data dictionaries full of query-irrelevant noise. Existing query-aware methods like schema linking and retrieval-based selection still feed redundant, verbose representations into the model.
DBCC: Offline Compression and Online Purification
DBCC reformulates the problem as database context compression—a query-agnostic transformation that rewrites schemas, semantic descriptions, and external documentation into a compact representation. The formal core is the SGCF (Support-Gain Component Factorization) principle, which unifies four operations under a single coverage objective: repeated column extraction, isomorphic table templating, semantic componentization, and evidence purification. DBCC runs as a database-side middleware: it performs structural and semantic compression offline, then a lightweight online step purifies evidence for the specific query. The whole pipeline is model-agnostic.
Two Orders of Magnitude Reduction, Real Accuracy Gains
On Spider 2.0-Snow and BIRD, DBCC cuts input context by up to 99%—from 2.6M to 34.7K tokens on the largest subset. Schema-linking strict recall jumps from 0% to 56.5% under DeepSeek-V3.2 and 63.1% under Claude Opus 4.7. End-to-end execution accuracy consistently increases by 1.8–1.9% across three recent Text-to-SQL systems. That's not a revolution—it's a concrete, measurable fix for a problem the field had stopped blaming. The code is open-sourced at https://github.com/MrBlankness/SchemaCompression.
DBCC won't make Text-to-SQL perfect, but it removes the biggest source of noise nobody was treating as primary. Expect this compression-first approach to become standard preprocessing in enterprise SQL pipelines.
Source: Database Context Compression for Text-to-SQL on Real-World Large Databases
Domain: arxiv.org
Comments load interactively on the live page.