Los modelos de recompensa de resultados superan la selección heurística de SQL en un 4,33% en BIRD

GradeSQL, un marco de modelo de recompensa de resultados escalable, supera el Best-of-N y el Voting de la mayoría basado en la ejecución en BIRD y Spider, con ganancias escalables en consultas complejas.

outcome reward modelstext to sqlgradesqlbird benchmarkspider benchmarktest time verification

Gains of up to +4.33% on BIRD and +2.10% on Spider are what you get when you replace heuristic SQL selection with a learned Outcome Reward Model.

Why Heuristic Signals Fall Short

Best-of-N sampling and Majority Voting are the default test-time strategies for Text-to-SQL. They rely on execution success or output frequency—heuristics that treat all syntactically valid queries as equals. Semantic correctness? Ignored. The result: high pass rates on simple queries, weak discrimination on complex ones. The authors of GradeSQL saw this gap and decided to treat test-time verification as a learned scoring problem instead.

GradeSQL: Automating Verifier Training Without Manual Labels

Outcome Reward Models (ORMs) aren't new—they've been used for test-time scaling and alignment in general LLM tasks. But applying them to structured query generation required a way to generate training data at scale without human annotation. GradeSQL solves this with an automated pipeline: candidate SQL generation paired with execution-based labeling. The framework trains a task-specific ORM that scores each candidate query by its semantic alignment with the natural language question, not just whether it runs.

Results That Scale with Complexity

On BIRD, ORM-based selection beats execution-based Best-of-N by 4.33 absolute points; on Spider, the gap is 2.10 points. More importantly, the improvement grows as the candidate set expands and as query complexity increases. That's the opposite of heuristic methods, which plateau quickly. The team tested across multiple open-source LLM families—no proprietary models required. Code, datasets, and the trained models are public.

Test-time verification via ORMs isn't a theoretical toy. It's a drop-in replacement for your current Best-of-N sampler, and it gets better exactly where you need it most.

Source: Test-Time Verification for Text-to-SQL via Outcome Reward Models
Domain: arxiv.org

Los modelos de recompensa de resultados superan la selección heurística de SQL en un 4,33% en BIRD

Why Heuristic Signals Fall Short

GradeSQL: Automating Verifier Training Without Manual Labels

Results That Scale with Complexity

More in Artificial Intelligence