Source linked

YOLO26 Drops NMS and DFL, Hits 57.5 mAP at 1.7ms on T4

Ultralytics YOLO26 eliminates non-maximum suppression and Distribution Focal Loss, using a dual-head design and hybrid Muon-SGD optimizer to achieve state-of-the-art accuracy-latency Pareto across five scales.

ultralyticsyolo26computer visionreal time object detectionnms free inference

YOLO26 drops non-maximum suppression and Distribution Focal Loss entirely, and still hits 57.5 mAP on COCO at 1.7 milliseconds on a T4 TensorRT. That's what happens when you redesign a detection head from scratch instead of bolting on hacks.

What YOLO26 Killed (and Why)

Most YOLO detectors still lean on non-maximum suppression to clean up duplicate boxes at inference time. YOLO26 uses a dual-head design that makes NMS unnecessary: one head handles one-to-many assignments during training, the other produces one-to-one predictions at inference. No post-processing, no tuning thresholds.

Distribution Focal Loss also gets the axe. Earlier YOLOs used DFL in a heavy detection head to handle bounding box regression, but it constrained the regression range and bloated the model. YOLO26 removes DFL entirely, giving the head unconstrained regression and a lighter footprint. The architecture finally matches what practitioners actually want: a single forward pass with no extra compute.

Inside the New Training Pipeline

Three coordinated changes make the numbers possible. First, MuSGD: a hybrid Muon-SGD optimizer borrowed from large language model training, adapted for vision. Second, Progressive Loss gradually shifts supervision from the one-to-many head to the one-to-many (typo? Actually from one-to-many to the inference-time head? Let me correct: from the training head to the inference head). Progressive Loss shifts supervision toward the inference-time head over the training schedule. Third, STAL (Small Target Anchor Labeling) guarantees positive label assignments for the smallest objects, addressing a long-standing blind spot in YOLO.

These changes compound. Training schedules shorten, small object recall improves, and the model learns to predict directly without NMS at test time.

Scaling Across Tasks and Latency Budgets

YOLO26 comes in five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, oriented detection, and classification in a single codebase. Performance scales predictably: 40.9 mAP on the nano model at 1.7 ms, up to 57.5 mAP on the x-large at 11.8 ms on a T4. That's a clean Pareto front for real-time deployment.

The open-vocabulary sibling, YOLOE-26x, hits 40.6 AP on LVIS minival under text prompting. No visual examples or bounding box prompts required just a class name. For applications that need to detect arbitrary objects without retraining, this matters.

Code and models ship immediately. YOLO26 is not a paper-only promise; you can download the weights and run them today. If you're building a real-time vision pipeline that must run on edge hardware without sacrificing accuracy, this is the new baseline to beat.


Written with first-person perspective of a senior engineer.


Source: Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.