Source linked

LEAP curriculum cuts ViT training FLOPs by 25% while boosting ImageNet-100 accuracy 12 points

A new training curriculum for feature-based knowledge distillation lets a ViT-S student hit 90.1% on ImageNet-100, a +12.24% improvement over baseline, while saving a quarter of training compute.

leapvision transformerdinov2knowledge distillationmodel compressionimagenet 100

A ViT-S student distilled from a DINOv2 teacher using the LEAP curriculum hits 90.1% top-1 accuracy on ImageNet-100. That is a +12.24% jump over the baseline student trained without the curriculum. And it does it while saving 25.1% of training FLOPs and 21% of wall-clock training time.

The problem: teacher-student gap in feature-based distillation

Feature-based knowledge distillation for Vision Transformers has a fundamental bottleneck: a small student cannot faithfully mimic a large teacher's complex intermediate feature maps. The student's limited capacity means it wastes epochs trying to reproduce patterns it is not yet ready to understand. LEAP sidesteps this by turning the teacher's feature hierarchy into a curriculum.

How LEAP works: adaptive difficulty progression

LEAP treats the 12 intermediate feature maps of a ViT teacher (at each block output) as a sequence of progressively harder targets. Early in training, the student only distills from the first few layers. As training proceeds and the student builds a solid low-level representation, LEAP automatically selects deeper features to match. This adaptive progression accelerates convergence and prevents the student from getting stuck on abstractions it cannot handle.

The authors release code at https://github.com/KevinZ0217/LEAP. Their experiments cover multiple student sizes and training datasets. On ImageNet-100, the curriculum also enables early-stopping of teacher forward passes during the initial training stages, which is where the 25.1% FLOPs saving and 21% time saving come from.

Results that matter for edge deployment

On the ImageNet-1K scale, LEAP delivers +3.84% and +7.75% improvement on the Oxford and Paris instance retrieval benchmarks respectively. Those are non-trivial gains for a task that demands fine-grained feature discrimination. For anyone shipping a ViT-based model to a phone, drone, or embedded camera, shaving a quarter of training cost while gaining double-digit accuracy on a 100-class benchmark means the tradeoff between model size and performance just shifted.

The obvious next question: can this curriculum transfer to other backbone families or to multimodal models? The authors frame LEAP specifically for ViT feature distillation, but the core idea of ordering the teacher's intermediate representations by difficulty is general enough to try on ConvNext or hybrid architectures.


Source: LEAP: Layer-skipping Efficiency via Adaptive Progression for Vision Transformer Distillation
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.