Source linked

Probabilistic Programs Teach LLMs True Uncertainty オリジナルタイトル

10,000 の確率プログラムシナリオを精密に調節すると、ポストホック温度スケーリングが達成できることを超える誘導推論のカリバリングが向上します。

program based posterior traininglarge language modelsinductive reasoningprobabilistic programmingcalibrationfine tuning

Post-hoc temperature scaling can't touch the calibration gains from fine-tuning on probabilistic programs—that's the headline from a new approach called Program-based Posterior Training (PPT).

Why Inductive Reasoning Is a Different Beast

Standard LLM post-training targets deductive tasks like math and coding. Correctness is binary, verifiable. Real-world reasoning is inductive: agents draw uncertain beliefs from sparse, ambiguous observations. Curating labeled datasets for that is a nightmare, and the targets themselves are distributional—a single right answer doesn't capture the uncertainty.

PPT Generates Its Own Training Data

The authors built PPT to sidestep those curation problems. An LLM generates diverse open-world scenarios as probabilistic programs, runs proper probabilistic inference to produce distributional target responses (soft labels), then fine-tunes the LLM on those outputs. They used 10,000 programmatically generated scenarios for fine-tuning.

Evaluation hit three axes: held-out motifs, human-labeled judgments, and external benchmarks for estimation and calibration. PPT substantially improved estimation accuracy on the held-out inductive tasks, increased alignment with human judgments, and transferred to those external benchmarks.

Calibration That Sticks

The key result: gains in raw calibration are not subsumed by post-hoc temperature scaling. That means the model didn't just rescale its output logits—it internalized uncertainty at a deeper level during fine-tuning. This is the kind of calibration that matters when an LLM needs to say "I'm 70% sure" and actually be right 70% of the time on inductive problems it's never seen.

These results suggest that mediating training through probabilistic programs is a viable path for post-training LLMs to perform approximate inductive inference. Next step: scaling the approach to richer scenario classes and seeing whether the calibration holds up under distribution shift outside the 10,000-program training set.


Source: Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.