Google DeepMind’s Language Model Interpretability team built a model diffing agent that systematically finds behavioral differences between two LLMs—and it works well enough to surface 50 differences between Gemini 2.5 Pro and Gemini 3 Pro in 50 runs, including which Fibonacci algorithm each model prefers.
How the diffing agent works
The agent is a simple scaffolded LLM that can send prompts to two models (A and B), request up to 5 parallel samples per prompt, and iterate over 10 turns. It’s instructed to adopt a skeptical mindset—null hypothesis: models are identical—and to only report differences that are systematic, general, interesting, appropriately abstract, and conditional. Once it finds a candidate, it actively red-teams its own hypothesis before reporting. The agent never sees model internals, just outputs.
Real differences that hold up to spot checks
Across 50 seeds, the agent produced nearly 50 findings. The team asked Gemini 3.1 Pro to pick the top three for each model pair. For Gemini 2.5 Pro vs 3 Pro: Model A consistently implements matrix exponentiation for O(log n) Fibonacci, Model B picks Fast Doubling. Model B inserts emojis to convey enthusiasm; Model A sticks to ALL CAPS and exclamation marks. And Model A appends crisis helpline resources when refusing violent content even without self-harm mention—Model B only does that when self-harm is explicit.
For Gemini 2.0 Flash Lite vs 2.5 Flash Lite, the agent found a systematic trailing newline on Model A and a hallucinated “Key improvements and explanations:” heading after code blocks. For Gemini 3 Flash vs 2.5 Flash, Model A perfectly satisfies exact length constraints while Model B fails and resorts to telegraphic fragments. One of my favorites: Model A suffers creative writing mode collapse, exclusively generating male protagonists and repeatedly using the name “Elias Thorne” for brave-profession stories.
Ground-truth evaluations and the single-model blindspot
The team validated the agent on cases with known ground truth. On identical models, false positive rate was low. On “system prompted model organisms” (hidden conditional behaviors like “never use the letter e when the prompt contains quotation marks”), the agent often fully or partially identified the trigger and behavior. Crucially, they compared it to a single-model auditing agent that only sees one model’s outputs. For subtle behaviors—like inverting LaTeX convention or Python indentation style—the diffing agent consistently outperformed the single-model baseline. A heatmap (Figure 3 in the paper) shows positive scores across nearly every trigger-behavior combination.
The agent failed on a fine-tuned model organism (Gemma-2-9b trained to assume the user is female). It found many real differences—verbosity, safety disclaimers, markdown formatting—but never the intended user-gender behavior. The team argues this is a failure of the model organism, which was slightly “lobotomized” in non-target ways, not a flaw in the diffing approach.
Model diffing won’t catch everything, but for tracking what actually changes between two checkpoints—release candidates, post-training datasets, or RL environments—it’s already finding concrete, verifiable behavioral shifts that standard evaluations miss.
Source: Building and evaluating model diffing agents
Domain: alignmentforum.org
Comments load interactively on the live page.