CaVe-VLM-CoT achieves 87.1% accuracy on ScienceQA by enforcing evidence-grounded reasoning through a five-stage closed-loop pipeline that routes verification failures back to retrieval.
Why VLMs Still Hallucinate and Why Existing Fixes Fall Short
Standard vision-language models produce fluent outputs that look right but fabricate details. Chain-of-thought prompting and retrieval-augmented generation help, but neither enforces step-level citation grounding. If the first retrieval returns garbage, the model keeps rolling with it. No existing system jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding in one pass.
CaVe-VLM-CoT tackles that by treating the reasoning process as a closed loop with explicit feedback. The pipeline consists of five modules: Extractor, Retriever, Solver, Citation Injector, and Verifier. When the Verifier detects an ungrounded claim, it sends structured feedback to the Extractor, which then performs targeted re-retrieval instead of blindly repeating the same mistake.
Five Stages, One Closed Loop: Verifier Failures Trigger Re-Retrieval
The Extractor parses the visual input and question into discrete atomic claims. The Retriever fetches relevant visual evidence from the image or external knowledge base. The Solver performs chain-of-thought reasoning over the retrieved evidence. The Citation Injector attaches source references to each reasoning step. The Verifier then checks each step for citation faithfulness and cross-modal alignment. Any step that fails verification sends a correction signal back to the Extractor, which adjusts its extraction criteria and triggers a new retrieval cycle.
This agentic-RAG design means the system can self-correct mid-inference without architectural changes or prompt tweaks. The authors report that the same base VLM, without any fine-tuning, works with the pipeline because the modules operate as wrappers around the model.
23 Component-Wise Metrics and a Composite CaVeScore
No existing benchmark measured all the dimensions that matter for interpretable VLM reasoning. CaVe-VLM-CoT introduces a suite of 23 component-wise metrics spanning retrieval precision, step-level citation accuracy, and overall evidence grounding. The headline composite is CaVeScore, which weights accuracy, citation precision and recall, attribution, and evidence grounding into a single number.
On ScienceQA, the framework scores 87.1% accuracy and 56.6% CaVeScore. On the much harder MMMU benchmark covering 30 subjects, it achieves 55.2% accuracy and 35.7% CaVeScore. Those numbers give a concrete baseline for any team trying to build citation-grounded VLMs.
By making the verification-revision loop explicit and measurable, CaVe-VLM-CoT gives the field a concrete recipe for building VLMs that cite their sources. Expect future work to extend the metric suite to more modalities and larger-scale evaluations.
Source: CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework
Domain: arxiv.org
Comments load interactively on the live page.