Only 1.9% of response tokens need supervision to improve code LLM performance over full-token SFT. The CODEBLOCK paper, posted on arXiv, shows that fine-tuning code models can ignore 98% of output tokens without harming generation quality - and actually beat the standard approach.
The Problem with Uniform Cross-Entropy for Code
Standard supervised fine-tuning applies the same loss to every token in the response. For code, that's wasteful. Most tokens in a generated program - boilerplate, comments, trivial variable names - carry little learning signal. Token-level masking from natural-language SFT doesn't transfer cleanly either. Cutting individual tokens breaks syntactically coherent units like function bodies or loop blocks, because code depends on structural completeness and definition-use chains.
How CODEBLOCK Picks the Right Bits
CODEBLOCK first filters for high-quality instruction-response pairs. Then it partitions code responses into syntactically coherent coding items - not tokens, not lines, but structure-complete blocks. Each block's utility is estimated by aggregating generalized cross-entropy over its core logic tokens. Blocks are then reranked using data-flow reach and bridge signals that identify which blocks propagate or connect important program dependencies. During training, the full response stays as context, but loss is applied only to the selected code items and informative natural language tokens.
Results That Make You Rethink Loss Functions
On six code-generation benchmarks, CODEBLOCK achieves stronger average pass@1 than full-token SFT and competitive baseline selection methods. The kicker: it does this with only 1.9% of supervised response tokens. That means the model learns more useful patterns from less than 2 cents of each dollar of training signal. The gains come from focusing on the structural spine of each program rather than drowning in noise.
Expect this approach to push sparse supervision deeper into code-specific training pipelines. When you can throw out 98% of the loss signal and still come out ahead, the default assumption that every token deserves equal attention looks paper-thin.
Source: CODEBLOCK: Learning to Supervise Code at the Right Granularity
Domain: arxiv.org
Comments load interactively on the live page.