Result Reward Models Outperform Heuristic SQL Selection by 4.33%

GradeSQLは、スケーラブルな結果報酬モデルフレームワークで、BIRDとスパイダーで実行ベースのBest-of-NとMajority Votingを打ち負かし、複雑なクエリでスケーラアップする。

outcome reward modelstext to sqlgradesqlbird benchmarkspider benchmarktest time verification

Gains of up to +4.33% on BIRD and +2.10% on Spider are what you get when you replace heuristic SQL selection with a learned Outcome Reward Model.

Why Heuristic Signals Fall Short

Best-of-N sampling and Majority Voting are the default test-time strategies for Text-to-SQL. They rely on execution success or output frequency—heuristics that treat all syntactically valid queries as equals. Semantic correctness? Ignored. The result: high pass rates on simple queries, weak discrimination on complex ones. The authors of GradeSQL saw this gap and decided to treat test-time verification as a learned scoring problem instead.

GradeSQL: Automating Verifier Training Without Manual Labels

Outcome Reward Models (ORMs) aren't new—they've been used for test-time scaling and alignment in general LLM tasks. But applying them to structured query generation required a way to generate training data at scale without human annotation. GradeSQL solves this with an automated pipeline: candidate SQL generation paired with execution-based labeling. The framework trains a task-specific ORM that scores each candidate query by its semantic alignment with the natural language question, not just whether it runs.

Results That Scale with Complexity

On BIRD, ORM-based selection beats execution-based Best-of-N by 4.33 absolute points; on Spider, the gap is 2.10 points. More importantly, the improvement grows as the candidate set expands and as query complexity increases. That's the opposite of heuristic methods, which plateau quickly. The team tested across multiple open-source LLM families—no proprietary models required. Code, datasets, and the trained models are public.

Test-time verification via ORMs isn't a theoretical toy. It's a drop-in replacement for your current Best-of-N sampler, and it gets better exactly where you need it most.

Source: Test-Time Verification for Text-to-SQL via Outcome Reward Models
Domain: arxiv.org

Result Reward Models Outperform Heuristic SQL Selection by 4.33%

Why Heuristic Signals Fall Short

GradeSQL: Automating Verifier Training Without Manual Labels

Results That Scale with Complexity

More in Artificial Intelligence