Source linked

El núcleo KV-Block-Major aumenta el rendimiento del MiniMax M3 en un 125%

La escasa atención del MiniMax M3 reduce el tiempo de descifrado en 15 veces, y el equipo de ingeniería de Together AI comprimió otro 81-125% de rendimiento reorganizando el movimiento KV en el ciclo exterior.

minimaxm3together aisparse attentionlong contextinference optimization

Together AI's inference team reorganized the attention loop—iterating over KV groups first—to eliminate redundant HBM-SRAM movement, yielding 81–125% higher throughput for MiniMax M3 at 1M context.

Why MSA Demands a Kernel Rewrite

MiniMax M3's novel architecture, MiniMax Sparse Attention (MSA), trades the standard N² attention for a block-sparse mechanism. Each query attends only to a selected set of key-value blocks, capped by a maximum token count. That alone gives 9x prefilling and 15x decoding speedups over the previous generation. But the real engineering challenge: when multiple queries from the same KV group map to the same key-value blocks, naively iterating over queries duplicates KV movement from HBM to SRAM. For 1M context, those wasted loads add up fast.

KV-Block-Major: Flipping the Loop to Save Bandwidth

Together AI's solution swaps the loop order. Instead of iterating over queries and fetching their relevant KV blocks, the new kernel iterates over KV blocks in the outer loop and calculates attention for all matching query tokens in the inner loop. This reorganization means each KV block is loaded from HBM exactly once, then reused across queries. The catch: partial output vectors must be reduced using Log-Sum-Exp scaling, but the arithmetic intensity gain dwarfs that overhead. The kernel also integrates seamlessly with paged attention for production inference engines.

Production Numbers: 125% Throughput Gain on B200

Measured under agentic-style traffic with 60K prefix cache and concurrency 8 on NVIDIA B200, the KV-Block-Major kernel cuts the wall-time share of attention computation per iteration significantly. Together AI reports throughput improvements between 81% and 125% across concurrency levels—not from model compression or quantization, but from a tight kernel-level optimization of the memory access pattern.

Multimodal Preprocessing Gets a Rust Gateway

Handling video and images alongside 1M-token text streams requires more than just a fast attention kernel. Together AI built a Rust-based multimodal preprocessing gateway to handle the complexity of image/video tokenization, decoding, and resizing before they hit the model server. That frontend work, combined with the sparse attention kernel, makes M3's native multimodality production-ready without bottlenecking the pipeline.

Together AI will host MiniMax M3 as a developer endpoint once the open-weights release goes live. The optimizations here—particularly the KV-Block-Major kernel—are directly applicable to any sparse attention model, meaning future long-context systems can borrow this loop-reordering trick without reinventing the wheel.


Source: Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets
Domain: together.ai

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.