VibeThinker-3B: Compact Model Matches Giants on Math and Code Benchmarks

A 3 billion parameter model just proved it can hang with the big boys on the hardest math and coding benchmarks. VibeThinker-3B scores 94.3 on AIME26, and that jumps to 97.1 when you use claim-level test-time scaling. On LiveCodeBench v6 it hits 80.2 Pass@1. Those numbers place it in the same performance band as DeepSeek V3.2, GLM-5, and Gemini 3 Pro - models that are orders of magnitude larger.

How They Squeezed Frontier Reasoning Into 3B Parameters

The team behind VibeThinker-3B didn't just juice up a small model with data. They built on their earlier 1.5B work and applied a structured post-training pipeline they call Spectrum-to-Signal. The recipe: curriculum-based supervised fine-tuning followed by multi-domain reinforcement learning, then a final offline self-distillation pass. The result is a dense 3B model that doesn't compromise on instruction following - it scores 93.4 on IFEval, meaning the extreme reasoning boost doesn't break controllability.

The Parametric Compression-Coverage Hypothesis

Here's the interesting part. The authors propose that verifiable reasoning - math, code, logic - is compressible into compact reasoning cores. Open-domain knowledge, facts, and long-tail concepts require broad parameter coverage. Their hypothesis suggests that small models aren't just efficient deployment targets; they can be a complementary path to frontier capability in parameter-dense regimes. The 96.1% acceptance rate on recent unseen LeetCode contests backs this up - the model generalizes out of distribution, not just memorizing benchmarks.

What Makes This Different From Distillation

This isn't another distillation paper where a tiny model approximates a giant teacher. VibeThinker-3B is trained from the base model using reinforcement learning with verifiable rewards - GRPO and SFT pipelines that teach grounded reasoning. The claim that it matches or exceeds DeepSeek V3.2 and Gemini 3 Pro is a direct challenge to the assumption that you need hundreds of billions of parameters for hard math and coding.

The Bottom Line for Practitioners

If you need a model that can do serious reasoning but can't afford a cluster to run a 671B MoE, VibeThinker-3B is the kind of result that makes you rethink your architecture choices. The code and weights aren't released yet - watch the arxiv page. But the Parametric Compression-Coverage Hypothesis gives a framework for deciding when compact models make sense: for verifiable tasks, not for encyclopedic knowledge.

Source: VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO
Domain: arxiv.org