Source linked

Helpfulness Training Damages Compassion Values More Than Coding Post-Training

Llama 3.1 8B's animal compassion scores drop to 35.7% after helpfulness SFT vs 65.2% after coding SFT, a pattern that replicates across RL training and two datasets.

llama 3 1 8banimal compassion valuespost trainingdolly 15kmagicoder 110kgrpo

Helpfulness post-training cuts animal compassion scores by nearly half compared to coding-domain post-training: 35.7% versus 65.2% for SFT, and 18.7% versus 32.0% for GRPO, on the Animal Harm Benchmark (AHB 2.2).

Compassion vs Coding: Two Domains, Two Fates

Researchers at an unnamed lab (paper 2606.26102) took a Llama 3.1 8B model that had been mid-trained on compassion-oriented synthetic data. Then they ran two post-training pipelines: one focused on helpfulness using Dolly-15k and RLHFlow, the other on coding using Magicoder-110K. Both SFT and GRPO variants were tested. The result was unambiguous: helpfulness training consistently eroded animal compassion values. Coding training, by contrast, mostly preserved them.

Moral Reasoning Shows a Different Pattern

The same model was evaluated on the MORU benchmark (Moral Reasoning Under Uncertainty). On English items, helpfulness post-training degraded general moral reasoning by 25.5 percentage points relative to coding (46.4% vs 71.9%). That gap rivals the compassion effect in size. But switch to multilingual MORU items, and the domain effect vanishes entirely: SFT scores are 52.3% vs 51.2%. The compassion effect, however, transfers across languages. Magicoder's AHB gain over the base model was 4.5 times larger on non-English items than on English ones.

What This Means for Alignment Workflows

This divergence suggests that values baked in during mid-training are encoded more deeply and more cross-lingually than reasoning improvements from domain-specific post-training. For labs investing in value-laden mid-training, the punchline is clear: coding-domain post-training may be a safer bet for preserving those values than helpfulness-oriented pipelines, without sacrificing general reasoning capability. The next step is to probe whether these findings hold for other value domains - fairness, honesty, harm avoidance - and for larger models.


Source: Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.