Source linked

RegimeVGGT reduce el coste de la atención transframe 6,7 veces con la compresión consciente del régimen

Un método de aceleración sin entrenamiento para VGGT identifica tres regímenes de atención distintos y aplica la compresión en forma de U para lograr una aceleración de 6,7 veces sin pérdida de calidad.

regimevggtvggtvisual geometry grounded transformer3d scene reconstructionattention mechanismtraining free acceleration

VGGT's quadratic cross-frame attention is the bottleneck that kills scalability for dense 3D scene reconstruction from multi-view images. RegimeVGGT cuts that cost by 6.7x without retraining or quality loss.

Three Distinct Attention Regimes in VGGT

Not all cross-frame attention layers are equal, and treating them uniformly is wasteful. The RegimeVGGT team ran spectral, probing, and causal analyses across VGGT's depth and found three clear regimes: shallow layers carry almost no cross-view structure, middle layers drive the actual cross-view alignment, and deep layers are redundant for dense geometry but remain essential for pose estimation. That heterogeneity is the key to aggressive targeted compression.

U-Shaped Compression: Saliency-Guided Merging and Protected Downsampling

RegimeVGGT applies a layer-wise U-shaped compression profile along two axes. Saliency-Guided Banded Merging protects geometry- and edge-salient tokens from being discarded. Meanwhile, Selectively Protected K/V Downsampling preserves cross-frame spatial coverage by using a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera and register tokens. The pose-critical path stays untouched, while geometry-heavy layers get compressed aggressively.

Training-Free and Matching Quality

No retraining, no fine-tuning. RegimeVGGT achieves 6.7x speedup over VGGT* at matched reconstruction quality. That's a straight multiplier for any pipeline using VGGT to recover dense 3D structure from multi-view images.

RegimeVGGT shows that layer-specific compression strategies can unlock huge speedups for transformer-based 3D vision models. Expect similar approaches to generalize to other cross-attention-heavy architectures where layer-wise redundancy analysis has been ignored.


Source: RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.