Source linked

VCG Solves Extreme Cold-Start Retrieval for E-Commerce Video Feeds

A multimodal retrieval engine using domain-adapted CLIP achieves 50% uplift in deep video completion by replacing collaborative filtering with zero-shot visual search.

video candidate generationclipe commercemultimodal retrievalcold startzero shot learning

The Video Candidate Generation (VCG) system boosts deep video completion by 50% in online A/B tests, solving the extreme cold-start problem that plagues shopping video feeds.

Why Collaborative Filtering Fails on Video Feeds

E-commerce platforms are shifting from static search-driven catalogs to dynamic, immersive video feeds. This transition breaks traditional recommender systems. New short-form videos have zero interaction history, so collaborative filtering can't touch them. Even worse, strong position and duration biases in video feeds distort standard engagement signals, making it hard to tell if a video is actually good or just happened to be at the top of the feed.

VCG Architecture: CLIP-Based Zero-Shot Retrieval

VCG attacks both problems by ditching behavioral history entirely. The team behind VCG built a domain-adapted vision-language model starting from CLIP, mapping users and videos into a shared semantic space. This enables zero-shot retrieval based on visual content alone. The system supports three bi-directional retrieval modes: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

A rigorous evaluation compared generative (LLM) embeddings against discriminative (CLIP) embeddings. Generative models excelled at attribute prediction but suffered from embedding space collapse during retrieval their vectors clump together and lose discriminative power. Discriminative CLIP embeddings held up much better for the actual retrieval task.

Results: 50% Uplift and Beyond

Online A/B testing confirmed VCG effectively mitigates engagement biases. The headline result: a 50% uplift in deep video completion, meaning users watched more of the video before scrolling away. That's a strong signal that VCG is surfacing content people actually want to see, not just content that happened to be placed well.

VCG proves that for video feeds, zero-shot retrieval with a well-tuned multimodal model beats waiting for interaction data to accumulate. The team shows that discriminative embeddings from CLIP outperform generative LLMs for retrieval, pointing toward a future where video feeds can recommend products without relying on user history at all.


Source: VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.