Source linked

Segment-Level Rewards Beat Holistic Scores for Multimodal LLMトレーニング

なぜGRPOの粗大なクレジット割り当てが長期的なビジョン言語のタスクに失敗し、セグメントごとの報酬分解がどのようにして3つのベンチマークでパフォーマンスを一貫して向上させるのか。

sd grpogroup relative policy optimizationmultimodal llmsvision languagereinforcement learningarxiv

GRPO hands every output a single scalar advantage, and that’s a problem when your output is a paragraph describing three panels of a comic strip. Coarse-grained holistic credit assignment underfits vision-language tasks where the answer is long, grounded, and composed of distinct semantic chunks. SD-GRPO fixes this by decomposing the response into verifiable segments and treating each one as its own reward signal.

The Problem: Holistic Rewards Miss the Details

Standard GRPO samples a rollout, compares the group, and assigns one advantage to the whole sequence. For a multi-panel dense caption or a multi-chart question answer, that single scalar averages over good and bad parts. The reward cannot tell the model which segment drove the positive outcome—credit attribution is blind. The paper shows that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. Longer answers make the noise worse.

How SD-GRPO Decomposes the Advantage Vector

SD-GRPO exploits the natural segmentation of long-form vision-language outputs. Instead of a single advantage per rollout, it computes per-segment rewards, then z-normalizes them across the rollout group to produce a vector of per-segment advantages. This turns the reward signal from a thumb-up-or-down into a detailed scorecard: the model learns which segment earned the reward. The method is model-agnostic—it only changes how the advantage is computed.

Benchmark Results: Controlled and Real-World Gains

On the controlled multi-panel dense-captioning task built from DOCCI, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. On multi-chart long-form VQA from MultiChartQA, the misattribution problem becomes measurable, and SD-GRPO pulls ahead. The most interesting result comes from the real-world scientific figure captioning task on MMSci, where subfigure captions share context. Here, blending holistic and per-segment rewards improves over both pure strategies—pure per-segment normalization alone isn’t enough when segments are semantically entangled. The hybrid approach wins.

Easy Integration with Existing Pipelines

SD-GRPO integrates into the Dr. GRPO framework with minimal implementation overhead. The authors confirm it can be applied to any GRPO variant. No architectural changes to the model, no new training loops—just a smarter reward aggregator. Expect this per-segment decomposition pattern to become standard for any multimodal RL training pipeline that needs to assign credit across long, grounded outputs.


Source: SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.