Source linked

Cohere's North Mini Code Posts 80% SWE-Bench Verified at 3B Active Parameters

Cohere's 30B-parameter MoE model with only 3B active parameters scores 33.4 on Artificial Analysis Coding Index, outperforming models with 10x its active parameters, and achieves 80.2% pass@10 on SWE-Bench Verified.

coherenorth mini codemixture of expertsswe benchcoding agentsapache 20

Cohere's North Mini Code, a 30B-parameter Mixture-of-Experts model with just 3B active parameters, achieves an 80.2% pass@10 on SWE-Bench Verified—topping many models many times its size. On Artificial Analysis' Coding Index it scores 33.4, beating Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), Devstral Small 2 (24B dense), and even much larger models like Nemotron 3 Super (120B-A12B) and Mistral Small 4 (119B-A6B). This is not a fluke; the model was trained for agentic coding from the ground up, and it shows.

Architecture: 128 Experts, Interleaved Attention, and Careful Routing

North Mini Code is a decoder-only Transformer sparse MoE. 128 experts, of which 8 are activated per token. The feed-forward block uses SwiGLU activation, and the router applies a sigmoid before top-k selection—no softmax competition, which helps with training stability. Attention is interleaved in a 3:1 ratio of sliding-window with RoPE to global attention with no positional embeddings. A single dense layer precedes the sparse layers. This design choice balances local context efficiency with long-range reasoning, critical for agentic tool calls spanning many turns.

Two-Stage SFT and RLVR: 70K Verified Tasks Across 5K Repos

Cohere post-trained North Mini Code using a cascade: first-stage SFT on a mix where code is 70% of trainable tokens (43% agentic tool-use, 27% competitive programming), then second-stage SFT on a 4.5B token mixture of only agentic and reasoning samples (61% code). Both stages are followed by reinforcement learning with verifiable rewards (RLVR). The data pipeline relies on containerized agentic coding environments—over 70K verifiable tasks from ~5K real-world repositories, deduplicated against SWE-Bench and SWE-Bench-Pro to avoid evaluation leakage. Context length increases from 64K to 128K across stages. Cohere reports that training on a near-complete length distribution produced shorter final trajectories; they deliberately truncated to 64K first, then extended. The final SFT model hits 80.2% pass@10 on SWE-Bench Verified and 55.1% pass@10 on Terminal-Bench v2.

Robustness Across Harnesses, Not Just One Scaffold

Real-world code agents don't live inside a single scaffold. North Mini Code was trained using multiple agent harnesses—SWE-Agent, OpenCode, and others—so it generalizes across different tool-use modalities (bash, str_replace_editor, etc.). Cohere explicitly avoided over-optimizing for one benchmark or harness. The model is released under Apache 2.0 on Hugging Face, making it immediately usable for developers building their own coding agents. North Mini Code is the first in Cohere's new model family; expect this architecture and training methodology to scale to larger models and perhaps even multi-modal agents later this year.


Source: Introducing North Mini Code: Cohere's First Model For Developers
Domain: huggingface.co

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.