Moebius Packs 10B-Level Image Inpainting Into 0.2B Parameters - 15x Faster

0.22 billion parameters. That's less than 2% of the 11.9B giant FLUX.1-Fill-Dev, yet Moebius matches or beats it across six benchmarks covering natural scenes and portraits. Kangsheng Duan, Ziyang Xu, and colleagues from Huazhong University of Science and Technology and VIVO AI Lab just published the details, and the numbers are absurdly good: 26.01 ms per step, a 15x total inference speedup, and quality that holds its own against 10B-level industrial models.

Architecture Hack: Local-λ Mix Interaction Blocks

Moebius doesn't just shrink an existing U-Net. The team rebuilt the diffusion backbone from the ground up with their LλMI block. Two modules inside: Local-λ handles spatial context, Interactive-λ condenses global semantic priors. Both compress into fixed-size linear matrices, sidestepping the quadratic scaling of standard self- and cross-attention. This is where the parameter count plummets without collapsing representational capacity. The block design is the engine that makes extreme compression viable.

Distillation Without the Pixel-Space Tax

Shrinking the architecture alone would leave a gaping capacity hole. The remedy is an adaptive multi-granularity distillation strategy that operates entirely in latent space. No expensive pixel-space decoding during training. The student (Moebius) learns from the teacher (PixelHacker) across multiple granularities: intermediate features, diffusion trajectory alignment. A gradient-norm-based weighting mechanism dynamically balances the losses, preventing any single objective from dominating. This lets the 0.22B model absorb the teacher's semantic reasoning without saturating its limited capacity.

The payoff: Moebius outperforms FLUX.1-Fill-Dev on complex textures and facial plausibility in certain benchmarks, and matches it everywhere else. Not bad for a model that fits on a single consumer GPU and cranks out an inpainting step in 26 milliseconds. Real-time high-fidelity inpainting on edge devices just became practical.

Source: Moebius: 0.2B image inpainting model with 10B-level performance
Domain: hustvl.github.io

Moebius Packs 10B-Level Image Inpainting Into 0.2B Parameters - 15x Faster

Architecture Hack: Local-λ Mix Interaction Blocks

Distillation Without the Pixel-Space Tax

More in Artificial Intelligence