Over thousands of decoding steps, tiny changes in the candidate token set compound into wildly different reasoning trajectories. Fixed truncation thresholds can't keep up. That's why Adaptive Nucleus Truncation Sampling (ANTS) matters: it turns a static decoding rule into an active rollout-control mechanism.
Why Fixed Thresholds Fail at Long Budgets
Existing methods like top-p, min-p, and fixed top-n sigma sampling all rely on a constant cutoff. They work fine for short generations, but at 8K, 16K, or 32K tokens, entropy shifts, task difficulty changes, and the model needs to adapt its candidate pool. A threshold that works for a simple instruction breaks on a multi-step math proof. ANTS solves this by selecting standardized neighborhoods around the maximum logit before temperature scaling, then dynamically adjusting the truncation width using an entropy-conditioned controller. It also keeps a no-truncation fallback arm to stabilize training when truncation becomes unsafe.
The Numbers That Matter: Code, Math, and Instruction Following
Applied to a 33B-total / 4B-active sparse Mixture-of-Experts reasoning model, ANTS delivers concrete gains: +1.9 points at 8K budget, +3.8 at 16K, and +5.2 at 32K. Instruction following sees the biggest jump - IFBench improves by over 10 points at 32K. AIME 2025, a hard math benchmark, gains 7 points. Code generation reveals a budget interaction: on Codeforces, ANTS trails the baseline at 8K but reverses that gap and substantially improves ELO at 16K and 32K. That tells me the method isn't a free lunch at every scale, but it scales better where it matters most.
ANTS Makes Sampler Design a Scaling Strategy, Not a Hyperparameter
Here's what I find interesting: ANTS doesn't just apply a fixed top-n sigma cutoff. It uses an entropy-based controller that widens or narrows the truncation window based on the model's uncertainty. That's a fundamentally different approach - treating sampler design as part of the reasoning pipeline rather than a static knob you tune once and forget. The paper argues that sampler design should be part of how we stabilize and scale long-budget reasoning. I agree. When a single decoding decision can cascade across 32K tokens, you need a controller that adapts, not a heuristic.
The results show that this adaptive approach works, especially for instruction following and math where precision matters. Code generation takes longer to kick in, but once the budget is high enough, the gap flips. Treating sampler design as a first-class part of reasoning architecture, not just a decoding hyperparameter, is how we'll push past the next token limit.
Source: Adaptive Nucleus Truncation for Long-Form Reasoning
Domain: arxiv.org
Comments load interactively on the live page.