Source linked

Adaptive Truncation Sampling Lifts Long-Form Reasoning by 5+ Points at 32K Tokens

A new sampling method called ANTS adapts truncation width based on entropy, yielding up to +10 points on IFBench and +7 on AIME 2025 at 32K generation budgets.

antsadaptive nucleus truncation samplingmixture of expertslarge language modelsreasoningdecoding strategies

Over thousands of decoding steps, tiny changes in the candidate token set compound into wildly different reasoning trajectories. Fixed truncation thresholds can't keep up. That's why Adaptive Nucleus Truncation Sampling (ANTS) matters: it turns a static decoding rule into an active rollout-control mechanism.

Why Fixed Thresholds Fail at Long Budgets

Existing methods like top-p, min-p, and fixed top-n sigma sampling all rely on a constant cutoff. They work fine for short generations, but at 8K, 16K, or 32K tokens, entropy shifts, task difficulty changes, and the model needs to adapt its candidate pool. A threshold that works for a simple instruction breaks on a multi-step math proof. ANTS solves this by selecting standardized neighborhoods around the maximum logit before temperature scaling, then dynamically adjusting the truncation width using an entropy-conditioned controller. It also keeps a no-truncation fallback arm to stabilize training when truncation becomes unsafe.

The Numbers That Matter: Code, Math, and Instruction Following

Applied to a 33B-total / 4B-active sparse Mixture-of-Experts reasoning model, ANTS delivers concrete gains: +1.9 points at 8K budget, +3.8 at 16K, and +5.2 at 32K. Instruction following sees the biggest jump - IFBench improves by over 10 points at 32K. AIME 2025, a hard math benchmark, gains 7 points. Code generation reveals a budget interaction: on Codeforces, ANTS trails the baseline at 8K but reverses that gap and substantially improves ELO at 16K and 32K. That tells me the method isn't a free lunch at every scale, but it scales better where it matters most.

ANTS Makes Sampler Design a Scaling Strategy, Not a Hyperparameter

Here's what I find interesting: ANTS doesn't just apply a fixed top-n sigma cutoff. It uses an entropy-based controller that widens or narrows the truncation window based on the model's uncertainty. That's a fundamentally different approach - treating sampler design as part of the reasoning pipeline rather than a static knob you tune once and forget. The paper argues that sampler design should be part of how we stabilize and scale long-budget reasoning. I agree. When a single decoding decision can cascade across 32K tokens, you need a controller that adapts, not a heuristic.

The results show that this adaptive approach works, especially for instruction following and math where precision matters. Code generation takes longer to kick in, but once the budget is high enough, the gap flips. Treating sampler design as a first-class part of reasoning architecture, not just a decoding hyperparameter, is how we'll push past the next token limit.


Source: Adaptive Nucleus Truncation for Long-Form Reasoning
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.