Source linked

False Beliefs: LLM Mentalizing Emerges Late, Crumbles Under 'Thinks'

Above-chance false belief task performance depends on model size and training volume, emerges late in pretraining, and is most improved by post-training (SFT, DPO) - but remains fragile to non-factive verbs.

olmo2pythiafalse belief taskmentalizingsituation modelinglarge language models

Olmo2 13b's ability to correctly attribute false beliefs to agents emerges only after substantial pretraining, and even then, a single non-factive verb like "thinks" can derail it — a fragility that should worry anyone relying on LLMs for theory-of-mind tasks.

False Belief Performance: Late, Size-Dependent, and Fragile

The team behind this study — all from the University of Edinburgh and the Alan Turing Institute — tested multiple checkpoints of the Olmo2 and Pythia model suites on false belief tasks (FBT). Above-chance performance doesn't kick in until the model has seen enough training data and has enough parameters. Even then, the behavior is fragile: consistent with prior work, swapping the verb from a factive (e.g., "sees") to a non-factive (e.g., "thinks") increases false belief attributions in the True Belief condition. That means the model is pattern-matching on verb semantics rather than building a coherent mental model of the agent's knowledge.

Situation Modeling Precedes Mentalizing but Remains Incoherent

I find this part particularly telling: situation modeling accuracy — the ability to report basic factual properties of a scene — generally exceeds and emerges before FBT accuracy. But when you dig deeper, the situation representations are surprisingly incoherent. For Olmo2 13b, when asked about the knowledge state of the Antagonist agent (who always knows the item's true location), the model's answer is consistently influenced both by the Target agent's knowledge state and by the presence of non-factive verbs. That's not a partial mental model — it's a confused one.

Post-Training Helps, But the Core Fragility Persists

Supervised fine-tuning (SFT) and direct preference optimization (DPO) did boost FBT performance, especially on the condition most diagnostic of mentalizing: False Belief, Implicit. That suggests alignment techniques can mask or patch the underlying deficiencies, but the verb-fragility problem remains. The authors argue for developmental and stress-testing approaches when evaluating LLM capabilities, and I agree — standard benchmarks won't catch this kind of brittle reasoning.

What this means for practitioners: if you're building agents that need to model other agents' beliefs, don't trust the raw model. The developmental trajectory shows that even the largest, most-trained models build only partially coherent situation models, and that coherence can collapse under slight linguistic pressure.


Source: Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.