Source linked

RegimeVGGT réduit le coût de l'attention transversale de 6,7 fois avec la compression consciente du régime

Une méthode d'accélération sans entraînement pour VGGT identifie trois régimes d'attention distincts et applique la compression en forme de U pour atteindre une accélération de 6,7 fois sans perte de qualité.

regimevggtvggtvisual geometry grounded transformer3d scene reconstructionattention mechanismtraining free acceleration

VGGT's quadratic cross-frame attention is the bottleneck that kills scalability for dense 3D scene reconstruction from multi-view images. RegimeVGGT cuts that cost by 6.7x without retraining or quality loss.

Three Distinct Attention Regimes in VGGT

Not all cross-frame attention layers are equal, and treating them uniformly is wasteful. The RegimeVGGT team ran spectral, probing, and causal analyses across VGGT's depth and found three clear regimes: shallow layers carry almost no cross-view structure, middle layers drive the actual cross-view alignment, and deep layers are redundant for dense geometry but remain essential for pose estimation. That heterogeneity is the key to aggressive targeted compression.

U-Shaped Compression: Saliency-Guided Merging and Protected Downsampling

RegimeVGGT applies a layer-wise U-shaped compression profile along two axes. Saliency-Guided Banded Merging protects geometry- and edge-salient tokens from being discarded. Meanwhile, Selectively Protected K/V Downsampling preserves cross-frame spatial coverage by using a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera and register tokens. The pose-critical path stays untouched, while geometry-heavy layers get compressed aggressively.

Training-Free and Matching Quality

No retraining, no fine-tuning. RegimeVGGT achieves 6.7x speedup over VGGT* at matched reconstruction quality. That's a straight multiplier for any pipeline using VGGT to recover dense 3D structure from multi-view images.

RegimeVGGT shows that layer-specific compression strategies can unlock huge speedups for transformer-based 3D vision models. Expect similar approaches to generalize to other cross-attention-heavy architectures where layer-wise redundancy analysis has been ignored.


Source: RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.