Source linked

عندما لا تتفق النماذج، الطريق إلى نموذج مختلف: الفيديو QA يحصل على 1.81 نقطة

لا يزال التوازن الذاتي في نموذج واحد يفشل في الأسئلة الفوتوغرافية الصارمة؛ حيث تحويل 20% حيث تنتقل النماذج إلى نموذج ثان يزيد من دقة بنسبة 1.43-1.81 نقطة، مع أنواع الحركة والتمثيل تحصل على 5+ نقطة.

gemini 3 1 pro previewclaude opus 4 8implicitqacvpr 2026video question answeringinference time routing

Routing questions that stump Gemini 3.1 Pro Preview over to Claude Opus 4.8 lifts video QA accuracy by 1.81 points on the ImplicitQA challenge test set, hitting 82.03 AvgAcc. That's a pure inference-time trick: no labels, no training, just disagreement between three zero-temperature samples.

Why Self-Consistency Fails on Hard Video Questions

Majority voting across repeated samples of the same model is the go-to boost for language models. On ImplicitQA, where the correct answer must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout, it backfires. The authors observed that a single frontier video LLM already operates near its accuracy ceiling, and its errors on hard questions are correlated. Vote three times, get the same wrong answer three times.

The fix: don't fish in the same barrel. Triple-sample Gemini 3.1 Pro Preview at temperature zero - exploiting the genuine sample-to-sample variance in its video processing pipeline - and identify the roughly 20% of questions where the three outputs diverge. Those are the ones that need a different pair of eyes.

The Disagreement-Based Routing Recipe

Route the disagreed-upon subset to Claude Opus 4.8, which consumes uniformly sampled frames with adaptive thinking. No retraining, no fine-tuning, no labels. On the 1001-question validation set with public ground truth, the method improves AvgAcc by +1.43 over the best single sample of the primary model. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over best single sample of the primary model). Consistent across independent splits.

Where the Gains Come From: Motion, Counting, Spatial Reasoning

Not all categories benefit equally. Motion & Trajectory jumps +5.49, Inferred Counting gains +3.45, Vertical Spatial Reasoning +1.82. These are precisely the categories that depend on resolving references across shots - timing, occlusion, relative positions. The routing strategy doesn't help on simple look-up questions; it kicks in exactly where correlated single-model errors are worst.

This is a practical lesson for any team deploying video QA: blind self-consistency can be worse than nothing. Instead, spend your compute budget on a second model for the tough 20%. No training required, just a disagreement check and a smart router.


Source: Disagreement-Based Cross-Model Routing for Implicit Video Question Answering
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.