Source linked

CODEBLOCK supervisa el 1,9% de los tokens, derrota el SFT de tokens completos en código

Al seleccionar bloques de código completos de estructura en lugar de tokens aislados, CODEBLOCK utiliza solo el 1,9% de los tokens de respuesta supervisada, al tiempo que consigue un paso 1 más fuerte en seis benchmarks de generación de código.

codeblocklarge language modelscode generationsparse supervisiontoken selectiondata flow

Only 1.9% of response tokens need supervision to improve code LLM performance over full-token SFT. The CODEBLOCK paper, posted on arXiv, shows that fine-tuning code models can ignore 98% of output tokens without harming generation quality - and actually beat the standard approach.

The Problem with Uniform Cross-Entropy for Code

Standard supervised fine-tuning applies the same loss to every token in the response. For code, that's wasteful. Most tokens in a generated program - boilerplate, comments, trivial variable names - carry little learning signal. Token-level masking from natural-language SFT doesn't transfer cleanly either. Cutting individual tokens breaks syntactically coherent units like function bodies or loop blocks, because code depends on structural completeness and definition-use chains.

How CODEBLOCK Picks the Right Bits

CODEBLOCK first filters for high-quality instruction-response pairs. Then it partitions code responses into syntactically coherent coding items - not tokens, not lines, but structure-complete blocks. Each block's utility is estimated by aggregating generalized cross-entropy over its core logic tokens. Blocks are then reranked using data-flow reach and bridge signals that identify which blocks propagate or connect important program dependencies. During training, the full response stays as context, but loss is applied only to the selected code items and informative natural language tokens.

Results That Make You Rethink Loss Functions

On six code-generation benchmarks, CODEBLOCK achieves stronger average pass@1 than full-token SFT and competitive baseline selection methods. The kicker: it does this with only 1.9% of supervised response tokens. That means the model learns more useful patterns from less than 2 cents of each dollar of training signal. The gains come from focusing on the structural spine of each program rather than drowning in noise.

Expect this approach to push sparse supervision deeper into code-specific training pipelines. When you can throw out 98% of the loss signal and still come out ahead, the default assumption that every token deserves equal attention looks paper-thin.


Source: CODEBLOCK: Learning to Supervise Code at the Right Granularity
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.