FACE, the automatic evaluation method described in this thesis, aligns closely with human judgments while requiring zero reference answers. That's the metric we've been missing for conversational systems that don't parrot a single ground truth.
Users want direct answers, not link farms. But building a system that actually understands personal context - what you've said before, your preferences, your history - requires solving three specific problems. This thesis tackles all three head-on.
From Hyperlinks to Entity Links in Conversations
First problem: figuring out what entities a user is talking about when pronouns and half-sentences carry the weight. The author introduces ConEL, a dataset for conversational entity linking, and CREL, a method designed specifically for dialogue (as opposed to documents). CREL handles the messy coreference and ellipsis that break traditional EL systems.
LAPS: Personalized Data at Scale Without the Usual Cost
Second problem: generating responses that actually reflect a user's preferences. Most personalization research relies on synthetic data or small-scale human studies. LAPS (Large-scale Anecdotal Personalized Set) solves this by efficiently constructing large, human-written conversational datasets built around real user preferences. The key insight is that you don't need to record thousands of real conversations - you can structure preference elicitation into the data collection pipeline.
FACE: Evaluating Conversations Without a Gold Standard
Third problem: how do you judge a system when there's no single correct answer? Traditional metrics like BLEU or ROUGE fail for open-ended dialogue. FACE is reference-free - it scores entire conversations holistically, not turn by turn. The thesis reports that FACE's scores align closely with human judgments, which is the standard we should hold any automatic metric to.
What This Enables
The real engineering takeaway: these three pieces - ConEL/CREL for entity resolution, LAPS for personalized data, and FACE for evaluation - form a coherent pipeline. Anyone building a conversational system that needs to remember who you are and what you like now has a stack to benchmark against. The next step is seeing how these methods hold up in production latency budgets.
Source: Personalization and Evaluation of Conversational Information Access
Domain: arxiv.org
Comments load interactively on the live page.