Source linked

Stratified Greedy Algorithm Solves Three Hard Problems in Text-to-SQL Active Learning

A new algorithm with constant-factor approximation guarantees reduces annotation effort for text-to-SQL systems by handling unreliable annotators, diversity constraints, and unknown data structure.

text to sqlactive learninglarge language modelsfew shot learningrobust active learningstratified greedy algorithm

If you've ever built a text-to-SQL system, you know the bottleneck isn't the model — it's the annotated examples. A new paper from arXiv (2606.10125) formalizes the active selection of those examples as a constrained experimental design problem and delivers a stratified greedy algorithm that actually addresses the real-world mess.

Three Challenges That Break Standard Active Learning

Most active learning frameworks assume annotation quality is uniform and that you can freely pick any examples. In text-to-SQL, those assumptions collapse. The first problem is heteroscedasticity: annotation reliability varies per query because some SQL patterns are inherently harder for human labelers. Second, the chosen examples must be spatially diverse across semantic topics — a partition matroid constraint that prevents clustering around popular query types. Third, you don't know the true covariance structure of the embedding space. Call it misspecification.

The paper explicitly models all three, dropping the naive i.i.d. noise and unconstrained selection that plague vanilla approaches.

Stratified Greedy with Provable Guarantees

The authors propose a stratified greedy algorithm that maximizes a heteroscedastic mutual information objective. They prove the objective remains submodular and approximately monotonic on the intrinsic low-dimensional manifold of semantic query embeddings, yielding a constant-factor approximation guarantee.

More interesting is the spectral bound: when your surrogate kernel diverges from the real data-generating process, the approximation guarantee degrades gracefully rather than catastrophically. That means the method won't silently implode when the embedding space shifts — a practical necessity for production deployments.

Empirical results show significant reduction in labeling effort while maintaining high text-to-SQL retrieval accuracy. No specific percentage is given in the abstract, but the theoretical scaffolding is strong enough that any engineer running a few-shot retriever pipeline should pay attention.

This work gives practitioners a principled way to spend their annotation budget, and I expect to see it folded into production few-shot retriever pipelines soon.


Source: Robust Active Learning for Few-Shot Example Selection in Text-to-SQL
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.