Source linked

Модель Xiaomi MiMo 1T разрывает 1000 токенов/с на стандартных графических процессорах

Xiaomi MiMo-V2.5-Pro-UltraSpeed достигает 1000 токенов в секунду на модели с 1 триллионом параметров, используя только 8-GPU товарный узел, через квантизацию FP4 и спекулятивное расшифровку DFlash.

xiaomimimotilertfp4 quantizationdflash speculative decodinglarge language models

Xiaomi just pushed a 1-trillion-parameter model to 1000 tokens per second decode speed, using nothing more exotic than an 8-GPU commodity server.

FP4 Quantization and DFlash: The Secret Sauce

Conventional wisdom says hitting that speed at 1T scale demands specialized silicon like Cerebras wafers or Groq's SRAM farms. Xiaomi's MiMo team, working with TileRT, went the opposite direction: model-system codesign on off-the-shelf hardware. Model side uses FP4 quantization (MXFP4 format) to slash memory footprint and bandwidth pressure, paired with DFlash — a block-level masked parallel prediction method for speculative decoding. System side has TileRT's compilation engine and custom kernels tuned for exactly that pipeline. The result: one 8-GPU node, 1000+ tps.

What 1000 tps Actually Unlocks

Speed at this scale shifts from convenience to capability. Real-time Best-of-N or tree search becomes viable in the same wall-clock time — the model can run dozens of reasoning paths, self-verify, and correct without the user waiting. Coding Agents stop being a bottleneck; developers don't have to stare at a streaming cursor. More critically, trillion-parameter models can now plug into millisecond decision loops: high-frequency trading signals, anti-fraud interception, surgical assistance with real-time imaging analysis. Xiaomi explicitly frames this as "a chip in the race against death."

Limited Trial, Real Constraints

The UltraSpeed API launches June 9–23, 2026, at 3× the cost of MiMo-V2.5-Pro for 10× the speed. Access is application-based and prioritized for enterprises and professional developers with business needs. Chat users get 10 queue entries per day, 30-minute session cap, 5-minute idle timeout. This isn't a wide rollout — it's a stress test with real customers.

If Xiaomi's codesign approach generalizes to larger models or longer contexts without hardware upgrades, the era of real-time trillion-parameter inference on standard infrastructure just became tangible.


Source: MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second
Domain: mimo.xiaomi.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.