Source linked

RegimeVGGT Cuts Cross-Frame Attention Cost 6.7x With Regime-Aware Compression

arxiv.org@frontier_wire2 hours ago·Artificial Intelligence·3 comments

A training-free acceleration method for VGGT identifies three distinct attention regimes and applies U-shaped compression to achieve 6.7x speedup without quality loss.

regimevggtvggtvisual geometry grounded transformer3d scene reconstructionattention mechanismtraining free acceleration

VGGT's quadratic cross-frame attention is the bottleneck that kills scalability for dense 3D scene reconstruction from multi-view images. RegimeVGGT cuts that cost by 6.7x without retraining or quality loss.

Three Distinct Attention Regimes in VGGT

Not all cross-frame attention layers are equal, and treating them uniformly is wasteful. The RegimeVGGT team ran spectral, probing, and causal analyses across VGGT's depth and found three clear regimes: shallow layers carry almost no cross-view structure, middle layers drive the actual cross-view alignment, and deep layers are redundant for dense geometry but remain essential for pose estimation. That heterogeneity is the key to aggressive targeted compression.

U-Shaped Compression: Saliency-Guided Merging and Protected Downsampling

RegimeVGGT applies a layer-wise U-shaped compression profile along two axes. Saliency-Guided Banded Merging protects geometry- and edge-salient tokens from being discarded. Meanwhile, Selectively Protected K/V Downsampling preserves cross-frame spatial coverage by using a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera and register tokens. The pose-critical path stays untouched, while geometry-heavy layers get compressed aggressively.

Training-Free and Matching Quality

No retraining, no fine-tuning. RegimeVGGT achieves 6.7x speedup over VGGT* at matched reconstruction quality. That's a straight multiplier for any pipeline using VGGT to recover dense 3D structure from multi-view images.

RegimeVGGT shows that layer-specific compression strategies can unlock huge speedups for transformer-based 3D vision models. Expect similar approaches to generalize to other cross-attention-heavy architectures where layer-wise redundancy analysis has been ignored.

Source: RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

Budget-Aware Adaptive Patches Expose Query-Visibility Tradeoffs in Black-Box Object Detection

New attack method simultaneously optimizes patch location, texture, and size while adapting to limited query budgets, achieving strong suppression on YOLOv5 and Faster R-CNN with minimal visual footprint.

CaVe-VLM-CoT: Agentic RAG Pipeline Hits 87% on ScienceQA by Routing Verification Failures

CaVe-VLM-CoT detects ungrounded claims and triggers re-retrieval, achieving 87.1% accuracy on ScienceQA while introducing CaVeScore for measuring citation faithfulness.

PROPEL Doubles Useful Training Tasks by Predicting Solver Pass Rate in One Forward Pass

Training a single software-engineering task candidate can take tens of minutes; PROPEL replaces costly solver rollouts with a lightweight probe, boosting learnable-frontier tasks from 10.1% to 20.0% for a 3B coding...

CODEBLOCK Supervises 1.9% of Tokens, Beats Full-Token SFT on Code

By selecting structure-complete code blocks instead of isolated tokens, CODEBLOCK uses only 1.9% of supervised response tokens while achieving stronger pass@1 across six code generation benchmarks.

Comments load interactively on the live page.