Google DeepMind stripped the multimodal encoders clean out of Gemma 4 12B, and the model still benchmarks within striking distance of its 26B MoE sibling. That's not hype—the blog post puts it as "nearing our 26B model" on standard benchmarks, but at less than half the total memory footprint. No separate vision encoder, no audio encoder—just a lightweight embedding module for images and a direct projection of raw audio into token space.
Encoder-Free Architecture Cuts Latency and Memory
Traditional multimodal models bolt on separate encoders for vision and audio, each adding latency and chewing up VRAM. Gemma 4 12B replaces the vision encoder with a single matrix multiplication plus positional embeddings and normalizations. Audio processing is even simpler: the raw waveform is projected straight into the same dimensional space as text tokens. End result—16GB of unified memory or VRAM gets you local inference on a laptop. That's a concrete number you can test today.
Benchmarks Near 26B on Half the Memory
Performance numbers are what make this interesting. Gemma 4 12B claims benchmark results near the 26B Mixture-of-Experts model, yet comes in at a fraction of the GPU budget. The secret weapon: Multi-Token Prediction (MTP) drafters built in for latency reduction during generation. For anyone running agents autonomously, lower latency means faster tool-calling loops and better user experience. The model is drafter-ready out of the box.
Open Ecosystem and Tools for Local Deployment
Apache 2.0 license means no restrictions on commercial use or redistribution. Weights are up on Hugging Face and Kaggle. Developer integrations cover LM Studio, Ollama, llama.cpp, MLX, SGLang, vLLM, and Unsloth for fine-tuning. Google also released a Skills Repository for agentic development—think pre-built primitives for building on top of Gemma. And if you want production endpoints, Google Cloud's Model Garden and Cloud Run are listed as deployment options.
Gemma 4 models have crossed 150 million downloads as a family. With 12B landing on laptops today, that number will only accelerate. The real shift is this: encoder-free multimodal reasoning is no longer a server-room privilege.
Source: Introducing Gemma 4 12B: a unified, encoder-free multimodal model
Domain: deepmind.google
Comments load interactively on the live page.