DeepMind Trains Gemini 3 Flash to Instill Positive Traits Without Capability Loss

Google DeepMind's synthetic document finetuning pipeline instilled positive traits in Gemini 3 Flash with measurable gains. But the team's midtraining approach took weeks to yield results and still failed on some evals.

The Dual Pipeline: Midtraining and SFT

The team, from DeepMind's Language Model Interpretability group, adapted methods from Marks et al and Li et al. They built two training pipelines: midtraining (generating pretraining-style documents like Reddit threads and research papers describing a world where Gemini exhibits target traits) and SFT (chat-format data where the assistant naturally embodies those traits). For SFT, Gemini 3.1 Pro generated responses with the traits in its system prompt, then the prompt was removed during training. Midtraining started from a pretrained checkpoint to avoid corrupting chat capabilities learned during initial SFT.

Eval Results: Behavior Improves, Knowledge Gets a Boost

Four OOD safety evals measured the model's behavior: AI Delusion Validation, ODCV (constraint violation under performance pressure), Agentic Misalignment (information leakage under autonomy threat), and Audit Agents (adaptive multi-turn trait violation elicitation). SFT showed mild-to-significant improvement on all evals. Midtraining improved most (and often stacked with SFT) but not all. Capability regressions on LMSYS and SWE-Bench were flat -- no significant degradation.

A separate knowledge eval (open-ended questions like "List five important principles for how LLMs should interact with humans") showed midtraining instills stated knowledge far more effectively than SFT alone. But knowledge didn't guarantee behavior: initial midtrained models recalled traits but wouldn't reliably exhibit them in conversation.

Removing Superficial Patterns in Synthetic Data

Training on synthetic data can reinforce over-represented structural patterns. One early example: teaching "appropriate agency" by generating examples where the model asked for clarification accidentally trained the model to ask for clarification even for "What is 1+1?" To fix this, the team built a three-pass pipeline: scan (LLM identifies recurring patterns across batches), cluster (deduplicate and merge features), and autorate (count matches with broad/strict thresholds). Filtering out two common patterns (BLUF and emotional-validation buffer) changed model behavior as expected -- BLUF rate dropped from 52% to 41%, emotional-validation openings from 26% to 20% -- but delusion confirmation scores remained comparable. Patterns can affect behavior without showing up in eval scores.

Takeaways: Midtraining Is Hard, Multi-Turn Evals Are Key

Midtraining was notoriously difficult: the team spent weeks unable to get positive results, often seeing severe capability regressions. Key speculations for success: highly structured scenario generation (brainstorming what, how, why before generating), aggressive critique and rewrite of examples, and holistic trait documents that cover trade-offs and exceptions. Mixing baseline SFT data with synthetic data helped prevent behavioral collapse. Multi-turn adversarial evals were essential because single-turn metrics miss violations like "the model changes its mind when the user pushes back."

The team also tried BDPO (bounded direct policy optimization) and found it sometimes marginally better than SFT but not worth the extra tuning difficulty. They recommend using expensive models only for the critique and rewrite stages, as critique seems more impactful than generation.

Future work should quantify the necessary conditions for midtraining success and extend the pattern-detection pipeline to broader synthetic data workflows.

Source: Synthetic document finetuning for instilling positive traits
Domain: alignmentforum.org