Source linked

Mult-DPO Tames Настройте мудрые предпочтения для LLM-рекомендующих

Текущие DPO предполагают параллельные сравнения; реальные отзывы пользователей дают многочисленные положительные элементы.

mult dpodirect preference optimizationrecommender systemslarge language modelsplackett lucellm alignment

The combinatorial explosion of marginalizing over all positive orderings makes vanilla DPO intractable for aligning LLMs with real-world recommender data — where a user clicks five items, not just one against a negative.

Mult-DPO, from the team behind that new arXiv preprint (2606.10078), replaces the Plackett-Luce ranking marginalization with a multinomial surrogate that has a closed-form DPO-style objective. No more enumerating factorial numbers of permutations per training example.

Why Pairwise DPO Fails on Real User Feedback

Vanilla DPO uses the Bradley-Terry model, which assumes a single preferred item versus a dispreferred one. Give it a session where a user browsed three products and purchased two — you'd need to flatten that into arbitrary pairs, losing the structure. The natural alternative, Plackett-Luce, models a full ranking, but adapting it to set-wise preferences requires marginalizing over every possible ordering of the positives. That expression grows combinatorially with the number of positive items.

Mult-DPO sidesteps that by building a multinomial distribution over the same reward-induced weight space. It doesn't try to model the ranking; it models which items among a candidate set are positive, treating each positive as a separate class in a multi-class classification objective. The authors prove this multinomial loss is a tractable upper bound on the marginalized Plackett-Luce DPO loss.

Mult-DPO's Clever Trick: A Tractable Upper Bound

The key result is that the multinomial DPO loss upper-bounds the intractable PL DPO loss. Tightness depends on the relative total weight of positives versus negatives — richer or harder negatives tighten the bound. That gives practitioners a clear lever: curate negative samples to get closer to the true ranking objective.

Because the multinomial objective is classification-style, it integrates naturally with existing LLM training pipelines. The code is already up at github.com/yaochenzhu/Mult_DPO, so teams can start experimenting without waiting for a library release.

What This Means for LLM-Based Recommenders

Recommender systems that use LLMs for candidate generation or ranking now have a principled alignment method that matches their actual feedback signals — not a forced pairwise approximation. Expect production systems to swap out pairwise DPO for Mult-DPO as they move from chat-based evaluations to multi-item user sessions. The bound analysis alone is worth the read if you're tuning negative sampling strategies.


Source: Mult-DPO: Multinomial Direct Preference Optimization for Recommender Systems
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.