Source linked

CaVe-VLM-CoT: Agentic RAG Pipeline Hits 87% en ScienceQA por el enrutamiento fallas de verificación

CaVe-VLM-CoT detecta afirmaciones sin fundamento y desencadena la recuperación, logrando una precisión del 87,1% en ScienceQA mientras introduce CaVeScore para medir la fidelidad de la cita.

cave vlm cotscienceqammmuvision language modelschain of thoughtretrieval augmented generation

CaVe-VLM-CoT achieves 87.1% accuracy on ScienceQA by enforcing evidence-grounded reasoning through a five-stage closed-loop pipeline that routes verification failures back to retrieval.

Why VLMs Still Hallucinate and Why Existing Fixes Fall Short

Standard vision-language models produce fluent outputs that look right but fabricate details. Chain-of-thought prompting and retrieval-augmented generation help, but neither enforces step-level citation grounding. If the first retrieval returns garbage, the model keeps rolling with it. No existing system jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding in one pass.

CaVe-VLM-CoT tackles that by treating the reasoning process as a closed loop with explicit feedback. The pipeline consists of five modules: Extractor, Retriever, Solver, Citation Injector, and Verifier. When the Verifier detects an ungrounded claim, it sends structured feedback to the Extractor, which then performs targeted re-retrieval instead of blindly repeating the same mistake.

Five Stages, One Closed Loop: Verifier Failures Trigger Re-Retrieval

The Extractor parses the visual input and question into discrete atomic claims. The Retriever fetches relevant visual evidence from the image or external knowledge base. The Solver performs chain-of-thought reasoning over the retrieved evidence. The Citation Injector attaches source references to each reasoning step. The Verifier then checks each step for citation faithfulness and cross-modal alignment. Any step that fails verification sends a correction signal back to the Extractor, which adjusts its extraction criteria and triggers a new retrieval cycle.

This agentic-RAG design means the system can self-correct mid-inference without architectural changes or prompt tweaks. The authors report that the same base VLM, without any fine-tuning, works with the pipeline because the modules operate as wrappers around the model.

23 Component-Wise Metrics and a Composite CaVeScore

No existing benchmark measured all the dimensions that matter for interpretable VLM reasoning. CaVe-VLM-CoT introduces a suite of 23 component-wise metrics spanning retrieval precision, step-level citation accuracy, and overall evidence grounding. The headline composite is CaVeScore, which weights accuracy, citation precision and recall, attribution, and evidence grounding into a single number.

On ScienceQA, the framework scores 87.1% accuracy and 56.6% CaVeScore. On the much harder MMMU benchmark covering 30 subjects, it achieves 55.2% accuracy and 35.7% CaVeScore. Those numbers give a concrete baseline for any team trying to build citation-grounded VLMs.

By making the verification-revision loop explicit and measurable, CaVe-VLM-CoT gives the field a concrete recipe for building VLMs that cite their sources. Expect future work to extend the metric suite to more modalities and larger-scale evaluations.


Source: CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.