Mellum2 activates only 2.5B parameters per token, providing a massive efficiency gain for high-throughput, low-latency inference workloads.
JetBrains has released Mellum2, a 12B-parameter Mixture-of-Experts (MoE) model trained from scratch on natural language and code. While the total parameter count sits at 12B, the MoE architecture ensures that only a small subset of parameters is used for each token, making it an ideal candidate for production environments where serving costs and latency are critical constraints.
Efficient Inference for Agentic Workflows
Modern AI architectures are moving away from monolithic frontier models toward multi-model systems. Production stacks increasingly require several specialized components—routers, retrievers, validators, and tool callers—to work in concert. Mellum2 targets these high-frequency, latency-sensitive operations.
Because it delivers more than 2x faster inference compared to similar-sized models, Mellum2 functions as a "focal" model within larger systems. It is specifically optimized for tasks like prompt classification, tool selection, and intermediate control-flow steps in agentic workflows. Instead of invoking a massive reasoning model for every subtask, engineers can use Mellum2 for planning, validation, and context preparation, significantly reducing the total compute footprint of the system.
Specialized Text and Code Capabilities
Unlike multimodal models that sacrifice density for breadth, Mellum2 focuses strictly on text and code. This specialization allows it to remain compact and highly effective for software engineering tasks, including RAG pipelines, context compression, and retrieval post-processing.
For teams handling proprietary codebases, the Apache 2.0 license and the model's efficient footprint make it a prime candidate for private, self-hosted deployments. This enables high-speed coding features, such as those found in IDEs, to run on internal infrastructure without the latency or privacy risks associated with external APIs.
JetBrains' release of Mellum2 signals a shift toward well-scoped, high-speed models that make complex AI stacks faster, cheaper, and easier to control.
Source: Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
Domain: huggingface.co
Comments load interactively on the live page.