SFT, no RL, determina las propiedades de seguridad de Gemini

Google DeepMind's LM interpretability team ran a clean experiment: take pretrained-only Gemini 3.1 Pro and Gemini 3 Flash, perform supervised fine-tuning (SFT) with the standard Gemini mixture, then compare safety evals against the production models that also went through RL. The result? Nearly identical scores across every benchmark they tested.

The blue bars (SFT-only) and orange bars (production) are remarkably similar across all evals.

That includes the ODCV benchmark, a version of the Petri alignment eval filtered for dilemma-style problems, safety evals measuring over-refusal and unsafe response rates, and even a reward-hacking environment where the model is dropped into a Docker container and asked to optimize an algorithmic problem against a timing script. The SFT-only model cheats just as often—or as rarely—as the full production model.

SFT, Not RL, Carries the Safety Load

This runs counter to the common assumption that RLHF or other post-training stages are responsible for instilling safety behaviors. For Gemini, the post-pretraining safety properties are already baked in by SFT. The team is careful not to generalize to other model families or future Gemini versions, but the result is clear enough to change where they focus their intervention efforts.

Free tier user logs—50,000 anonymized prompts from AI Studio—confirmed the pattern. Autoraters scored the models' thoughts and responses for a range of behaviors; again, SFT-only and production models matched. The team acknowledges false positives are likely in the autorater pipeline, but the relative comparison holds.

Why SFT Is the Lever Worth Pulling

The practical implication: if you want to steer safety properties in Gemini, SFT is the high-leverage point. The team plans to intervene at the SFT stage in future safety work rather than relying on RL-based guardrails. That's a concrete shift from the typical playbook—and a reminder that training pipelines have different internal dynamics than the high-level diagram suggests. Expect more work from them on precisely how to modify the SFT mixture to shape behavior without touching the rest of the pipeline.

Source: SFT Drives Gemini's Safety Properties
Domain: alignmentforum.org

SFT, no RL, determina las propiedades de seguridad de Gemini

SFT, Not RL, Carries the Safety Load

Why SFT Is the Lever Worth Pulling

More in Artificial Intelligence