Source linked

RegimeVGGTは、Regime-Aware 圧縮によりクロスフレームの注意コストを 6.7 倍削減します。

VGGTのトレーニングフリーの加速方法は、3つの異なる注意モードを特定し、品質の損失なしで6.7倍の加速を達成するためにU形圧縮を適用します。

regimevggtvggtvisual geometry grounded transformer3d scene reconstructionattention mechanismtraining free acceleration

VGGT's quadratic cross-frame attention is the bottleneck that kills scalability for dense 3D scene reconstruction from multi-view images. RegimeVGGT cuts that cost by 6.7x without retraining or quality loss.

Three Distinct Attention Regimes in VGGT

Not all cross-frame attention layers are equal, and treating them uniformly is wasteful. The RegimeVGGT team ran spectral, probing, and causal analyses across VGGT's depth and found three clear regimes: shallow layers carry almost no cross-view structure, middle layers drive the actual cross-view alignment, and deep layers are redundant for dense geometry but remain essential for pose estimation. That heterogeneity is the key to aggressive targeted compression.

U-Shaped Compression: Saliency-Guided Merging and Protected Downsampling

RegimeVGGT applies a layer-wise U-shaped compression profile along two axes. Saliency-Guided Banded Merging protects geometry- and edge-salient tokens from being discarded. Meanwhile, Selectively Protected K/V Downsampling preserves cross-frame spatial coverage by using a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera and register tokens. The pose-critical path stays untouched, while geometry-heavy layers get compressed aggressively.

Training-Free and Matching Quality

No retraining, no fine-tuning. RegimeVGGT achieves 6.7x speedup over VGGT* at matched reconstruction quality. That's a straight multiplier for any pipeline using VGGT to recover dense 3D structure from multi-view images.

RegimeVGGT shows that layer-specific compression strategies can unlock huge speedups for transformer-based 3D vision models. Expect similar approaches to generalize to other cross-attention-heavy architectures where layer-wise redundancy analysis has been ignored.


Source: RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.