Source linked

La personnalisation LLM échoue aux utilisateurs réels - les données synthétiques masquent le fossé

Dans 550 conversations humaines, les modèles ont produit des réponses personnalisées que les humains n'ont pas notées mieux que celles génériques, même si les juges LLM leur ont donné des notes élevées.

large language modelspersonalizationsynthetic datahuman evaluationarxiv

Personalized LLM responses that look great on synthetic benchmarks? Real humans judge them no better than generic output. A new paper from researchers (arXiv:2606.06614) collected 550 human conversations and 5,949 attribute-extraction judgments to find out just how wide the gap is.

Three Stages, Three Failures

The study breaks personalization into three concrete stages: extracting user attributes from conversation, pairing relevant attributes with new prompts, and incorporating those attributes into a response. Across all three, synthetic data hid the rot. At extraction, models consistently struggled to pull accurate attributes from real human dialogue. At relevance pairing, model judgments disagreed with human annotators on 11,919 paired examples. And at response generation (1,101 head-to-head comparisons), humans rated personalized outputs no better than generic ones — yet LLM-based judges confidently scored them higher.

Lightweight Fixes Hit a Wall

The authors introduced two training-based interventions — lightweight fine-tuning on human judgments — that shifted automated evaluation closer to human data in the first two stages. Good news for extraction and relevance scoring. But stage three refused to budge: learned reward models achieved only modest correlation with human ratings. Modeling what a human actually considers a better personalized response remains fundamentally harder than the synthetic-data crowd assumes.

What This Means for Your Pipeline

If you're building a personalization system on top of an LLM, the numbers from synthetic evals are lying to you. The 550-conversation dataset released with this paper gives you a way to ground your extraction, selection, and generation in what real users actually prefer. Until reward models can reliably capture human taste, trust a human rater over an LLM judge every time.


Source: Re-Centering Humans in LLM Personalization
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.