Source linked

LLM-Governed Robots Show Near-2x Moral Calibration Drop for Chinese and Japanese

A new audit framework tests 57,600 decisions across four LLMs and languages; Western-language calibration is nearly twice as reliable as for Chinese and Japanese, and prompting alone can't fix it.

llmsocial robotsmoral machine experimentcultural biasarxivai auditing

Moral calibration quality for LLM-governed social robots is nearly twice as strong for Western-language decisions as for Chinese and Japanese, even when you prompt the model in the local language. That's not a bug report; that's the central finding from a 57,600-decision audit of four LLMs across four country-language pairs, just posted on arXiv (2606.28345).

57,600 Moral Trade-Offs in Three Domains

The team behind this work—drawing on >8,000 cross-domain social robotics reviews—derived symmetry-controlled scenarios from the Moral Machine Experiment, swapping the classic "whom to spare" for the more immediate "whom to assist first." Those scenarios pit many vs. few, young vs. old, higher vs. lower status across care, education, and service contexts. They then ran four LLMs through four prompting regimes (zero-shot, few-shot, contrastive exemplars, reasoning-only) and benchmarked every decision against country-specific Moral Machine Experiment preference gradients.

Ordinal concordance tests measured whether models could actually differentiate cultural contexts. The governance typology they built maps three failure modes: gradient differentiation, directional tendency, and deliberation.

Prompting Isn't the Fix You Think It Is

Here's the uncomfortable part. Quality calibration for Western-language decisions came out nearly twice as strong as for Chinese and Japanese. High determinism in majority-first trade-offs systematically flattens cross-cultural gradients—the model doesn't hesitate, it just picks the larger group. Partial sensitivity to age- and status-based norms risks sidelining minorities because the model catches some cues but not others.

Prompting effects are uneven. Only contrastive exemplars produced consistent gains; reasoning-only prompts sometimes made gradient tracking worse. That directly contradicts the common engineering reflex of "just add a system prompt."

What This Means for Shipping Robots

If you're deploying an LLM-governed robot in Tokyo or Shanghai that decides who gets elevator priority or assistance first, you can't rely on a few English-authored safety prompts. The authors argue that model-level factors—training data composition, alignment methodology—are a more robust lever than prompt engineering. Their audit framework is designed as a pre-deployment gate, not a post-hoc analysis tool.

The next step is scaling these culture-specific moral gradients to more languages and real-world embodied setups, because the gap between a simulated trade-off and a robot actually turning away an elderly person is dangerously small.


Source: Auditing LLM-Governed Social Robots with Culture-Specific Moral Gradients
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.