VGGT's quadratic cross-frame attention is the bottleneck that kills scalability for dense 3D scene reconstruction from multi-view images. RegimeVGGT cuts that cost by 6.7x without retraining or quality loss.
Three Distinct Attention Regimes in VGGT
Not all cross-frame attention layers are equal, and treating them uniformly is wasteful. The RegimeVGGT team ran spectral, probing, and causal analyses across VGGT's depth and found three clear regimes: shallow layers carry almost no cross-view structure, middle layers drive the actual cross-view alignment, and deep layers are redundant for dense geometry but remain essential for pose estimation. That heterogeneity is the key to aggressive targeted compression.
U-Shaped Compression: Saliency-Guided Merging and Protected Downsampling
RegimeVGGT applies a layer-wise U-shaped compression profile along two axes. Saliency-Guided Banded Merging protects geometry- and edge-salient tokens from being discarded. Meanwhile, Selectively Protected K/V Downsampling preserves cross-frame spatial coverage by using a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera and register tokens. The pose-critical path stays untouched, while geometry-heavy layers get compressed aggressively.
Training-Free and Matching Quality
No retraining, no fine-tuning. RegimeVGGT achieves 6.7x speedup over VGGT* at matched reconstruction quality. That's a straight multiplier for any pipeline using VGGT to recover dense 3D structure from multi-view images.
RegimeVGGT shows that layer-specific compression strategies can unlock huge speedups for transformer-based 3D vision models. Expect similar approaches to generalize to other cross-attention-heavy architectures where layer-wise redundancy analysis has been ignored.
Source: RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer
Domain: arxiv.org
Comments load interactively on the live page.