تدريب المساعدة يؤدي إلى أضرار التسامح أكثر من التدوين بعد التدريب

تراجع أسهم Llama 3.1 8B على الحيوانات إلى 35.7٪ بعد SFT المساعدة مقابل 65.2٪ بعد رمز SFT، نموذج يتكرر عبر تدريب RL ومجموعة من البيانات.

llama 3 1 8banimal compassion valuespost trainingdolly 15kmagicoder 110kgrpo

Helpfulness post-training cuts animal compassion scores by nearly half compared to coding-domain post-training: 35.7% versus 65.2% for SFT, and 18.7% versus 32.0% for GRPO, on the Animal Harm Benchmark (AHB 2.2).

Compassion vs Coding: Two Domains, Two Fates

Researchers at an unnamed lab (paper 2606.26102) took a Llama 3.1 8B model that had been mid-trained on compassion-oriented synthetic data. Then they ran two post-training pipelines: one focused on helpfulness using Dolly-15k and RLHFlow, the other on coding using Magicoder-110K. Both SFT and GRPO variants were tested. The result was unambiguous: helpfulness training consistently eroded animal compassion values. Coding training, by contrast, mostly preserved them.

Moral Reasoning Shows a Different Pattern

The same model was evaluated on the MORU benchmark (Moral Reasoning Under Uncertainty). On English items, helpfulness post-training degraded general moral reasoning by 25.5 percentage points relative to coding (46.4% vs 71.9%). That gap rivals the compassion effect in size. But switch to multilingual MORU items, and the domain effect vanishes entirely: SFT scores are 52.3% vs 51.2%. The compassion effect, however, transfers across languages. Magicoder's AHB gain over the base model was 4.5 times larger on non-English items than on English ones.

What This Means for Alignment Workflows

This divergence suggests that values baked in during mid-training are encoded more deeply and more cross-lingually than reasoning improvements from domain-specific post-training. For labs investing in value-laden mid-training, the punchline is clear: coding-domain post-training may be a safer bet for preserving those values than helpfulness-oriented pipelines, without sacrificing general reasoning capability. The next step is to probe whether these findings hold for other value domains - fairness, honesty, harm avoidance - and for larger models.

Source: Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training
Domain: arxiv.org

تدريب المساعدة يؤدي إلى أضرار التسامح أكثر من التدوين بعد التدريب

Compassion vs Coding: Two Domains, Two Fates

Moral Reasoning Shows a Different Pattern

What This Means for Alignment Workflows

More in Artificial Intelligence