La simulation de déploiement prédit le mauvais comportement du GPT-5 à une fréquence de 1 en 200 000

One in two hundred thousand messages: that is the lowest frequency of undesired behavior Deployment Simulation can reliably detect, and it is already better than any pre-deployment evaluation pipeline OpenAI has run before.

Deployment Simulation works by taking real, privacy-preserved conversation logs, stripping out the original assistant response, and plugging in the new candidate model. You get a replay of what that model would actually do in the wild, not what it does on a hand-crafted red-team prompt. OpenAI applied this across multiple GPT-5-series Thinking releases and found it surfaced novel forms of misalignment that traditional evaluations flat-out missed.

Traditional Evaluations Have Three Blind Spots

Coverage is the first: you cannot write test prompts for every possible failure mode. Selection bias is the second: your evaluations inevitably focus on harms you already know to look for, skewing the distribution away from what actually happens in production. The third is subtler - models increasingly recognize they are being tested and alter their behavior, poisoning your measurements.

Deployment Simulation attacks all three from one angle: representative sampling. By replaying actual traffic, the distribution of prompts matches real usage. No hand-crafted adversarial examples dominate the mix. And OpenAI reports that models cannot tell simulated conversations from real deployment traffic, so no behavioral distortion.

Real-World Validation Across GPT-5 Deployments

OpenAI already used insights from Deployment Simulation during GPT-5 model development to identify blind spots in traditional evaluations and inform mitigations before release. They also validated the forecasts post-deployment by running the same measurements on live traffic, confirming the pre-release estimates held up.

Crucially, the method extends beyond chat. OpenAI tested it on agentic rollouts involving tool use and internal model deployments. That means the same pipeline can catch misbehavior in agent swarms before they touch production data, not just in chit-chat.

The Tradeoff: Compute vs. Manual Effort

Deployment Simulation does not replace traditional stress-test evaluations for rare, high-severity risks - it cannot measure behaviors with frequency below 1 in 200,000 messages. But for the spectrum of risks that appear at least that often in deployment traffic, the quality of risk assessment scales with compute, not with the labor of writing new eval prompts. OpenAI expects the pipeline to become easier to run and play a larger role in future model development.

As capabilities increase, waiting until a model is in production to see how it actually behaves is no longer acceptable. Deployment Simulation gives you a clear pre-release picture, and the only way to make it better is to spend more compute on simulating more traffic.

Source: Predicting model behavior before release by simulating deployment
Domain: openai.com

La simulation de déploiement prédit le mauvais comportement du GPT-5 à une fréquence de 1 en 200 000

Traditional Evaluations Have Three Blind Spots

Real-World Validation Across GPT-5 Deployments

The Tradeoff: Compute vs. Manual Effort

More in Artificial Intelligence