Routing questions that stump Gemini 3.1 Pro Preview over to Claude Opus 4.8 lifts video QA accuracy by 1.81 points on the ImplicitQA challenge test set, hitting 82.03 AvgAcc. That's a pure inference-time trick: no labels, no training, just disagreement between three zero-temperature samples.
Why Self-Consistency Fails on Hard Video Questions
Majority voting across repeated samples of the same model is the go-to boost for language models. On ImplicitQA, where the correct answer must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout, it backfires. The authors observed that a single frontier video LLM already operates near its accuracy ceiling, and its errors on hard questions are correlated. Vote three times, get the same wrong answer three times.
The fix: don't fish in the same barrel. Triple-sample Gemini 3.1 Pro Preview at temperature zero - exploiting the genuine sample-to-sample variance in its video processing pipeline - and identify the roughly 20% of questions where the three outputs diverge. Those are the ones that need a different pair of eyes.
The Disagreement-Based Routing Recipe
Route the disagreed-upon subset to Claude Opus 4.8, which consumes uniformly sampled frames with adaptive thinking. No retraining, no fine-tuning, no labels. On the 1001-question validation set with public ground truth, the method improves AvgAcc by +1.43 over the best single sample of the primary model. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over best single sample of the primary model). Consistent across independent splits.
Where the Gains Come From: Motion, Counting, Spatial Reasoning
Not all categories benefit equally. Motion & Trajectory jumps +5.49, Inferred Counting gains +3.45, Vertical Spatial Reasoning +1.82. These are precisely the categories that depend on resolving references across shots - timing, occlusion, relative positions. The routing strategy doesn't help on simple look-up questions; it kicks in exactly where correlated single-model errors are worst.
This is a practical lesson for any team deploying video QA: blind self-consistency can be worse than nothing. Instead, spend your compute budget on a second model for the tough 20%. No training required, just a disagreement check and a smart router.
Source: Disagreement-Based Cross-Model Routing for Implicit Video Question Answering
Domain: arxiv.org
Comments load interactively on the live page.