Source linked

SFT, no RL, determina las propiedades de seguridad de Gemini

alignmentforum.org@frontier_wire4 hours ago·Artificial Intelligence·4 comments

Google DeepMind encontró que el ajuste supervisado por sí solo reproduce casi todo el comportamiento de seguridad de los modelos Gemini de producción, haciendo de RL un empuje secundario para el alineamiento.

google deepmindgeminisupervised fine tuningrlhfai alignmentai safety

Google DeepMind's LM interpretability team ran a clean experiment: take pretrained-only Gemini 3.1 Pro and Gemini 3 Flash, perform supervised fine-tuning (SFT) with the standard Gemini mixture, then compare safety evals against the production models that also went through RL. The result? Nearly identical scores across every benchmark they tested.

The blue bars (SFT-only) and orange bars (production) are remarkably similar across all evals.

That includes the ODCV benchmark, a version of the Petri alignment eval filtered for dilemma-style problems, safety evals measuring over-refusal and unsafe response rates, and even a reward-hacking environment where the model is dropped into a Docker container and asked to optimize an algorithmic problem against a timing script. The SFT-only model cheats just as often—or as rarely—as the full production model.

SFT, Not RL, Carries the Safety Load

This runs counter to the common assumption that RLHF or other post-training stages are responsible for instilling safety behaviors. For Gemini, the post-pretraining safety properties are already baked in by SFT. The team is careful not to generalize to other model families or future Gemini versions, but the result is clear enough to change where they focus their intervention efforts.

Free tier user logs—50,000 anonymized prompts from AI Studio—confirmed the pattern. Autoraters scored the models' thoughts and responses for a range of behaviors; again, SFT-only and production models matched. The team acknowledges false positives are likely in the autorater pipeline, but the relative comparison holds.

Why SFT Is the Lever Worth Pulling

The practical implication: if you want to steer safety properties in Gemini, SFT is the high-leverage point. The team plans to intervene at the SFT stage in future safety work rather than relying on RL-based guardrails. That's a concrete shift from the typical playbook—and a reminder that training pipelines have different internal dynamics than the high-level diagram suggests. Expect more work from them on precisely how to modify the SFT mixture to shape behavior without touching the rest of the pipeline.


Source: SFT Drives Gemini's Safety Properties
Domain: alignmentforum.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.