Source linked

Baidu's CoeusBI Kills JOINs With LLM-Powered View Generation

CoeusBI replaces error-prone multi-table JOINs with LLM-generated single views, cutting SQL generation errors and schema linking costs across Baidu's production data platform.

baiducoeusbilarge language modelsbusiness intelligencesql generationhierarchical schema linking

JOIN operations are the silent killer of text-to-SQL accuracy in enterprise BI systems. Baidu's CoeusBI cuts the Gordian knot by training an LLM agent to rewrite those messy multi-table queries into clean single-view queries, without any human touching the semantic layer.

The JOIN Tax That Keeps Getting Worse

Every production BI system I've seen eventually hits the wall: frequent JOINs degrade SQL generation accuracy, wide schemas make schema linking a nightmare, and dialect-specific queries plus multi-turn conversations blow up costs. Baidu's paper lays out exactly these three failure modes. CoeusBI tackles all three with a single architectural bet: automated view generation.

The offline View Generation Agent uses error-feedback to autonomously convert complex JOIN queries into simple single-view queries. That eliminates the need for manual semantic modeling entirely. No more hand-crafting metric definitions or dimension tables. The agent learns from its own mistakes and produces views that are trivial for downstream models to query.

Hierarchy to the Rescue

CoeusBI's Hierarchical Schema Linking module leverages vector retrieval over those generated views, not over the raw warehouse schema. That handles wide schemas efficiently because the search space shrinks from thousands of columns to dozens of views. The retrieval is fast and the accuracy stays high.

A dynamic Routing Agent evaluates each dialogue context to decide whether to synthesize a new intermediate representation or patch an existing one. Then a deterministic SQL compiler, agnostic to dialect, compiles the result. No more writing separate compilers for Presto vs. Spark SQL vs. ClickHouse.

Production-Proven at Baidu Scale

Baidu deployed CoeusBI as a standalone service on its data platform, serving thousands of users daily across multiple business lines. The paper reports significant improvements in query accuracy, token efficiency, and user satisfaction compared to baseline methods. Those baselines include OpenAI's GPT-4 and other commercial LLM-based BI tools.

What matters most: CoeusBI does not require any manual configuration of metrics or dimensions. That's the unlock that lets it scale across Baidu's sprawling data ecosystem without a dedicated team of BI engineers maintaining the semantic layer every time a new table gets added.

What This Enables Next

CoeusBI proves that an offline view generation agent plus hierarchical retrieval can replace the hand-tuned semantic layer that has bottlenecked every text-to-SQL BI system since Looker. Expect other large-scale data platforms to copy this pattern, especially those supporting thousands of tables and daily schema changes that break manual configurations.


Source: CoeusBI: A Comprehensive Interactive Business Intelligence System Powered by LLMs at Baidu [Extended Version]
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.