GPT-5.5 Мгновенные ответы на вопросы здоровья имеют меньше ошибок, чем врачи

GPT-5.5 Instant's health responses now have fewer documented failure modes than responses written by physicians with unlimited time and internet access. That's not a claim from a lab demo, it's a result from a global network of physicians who wrote and then blind-reviewed 3,500 real-world health conversations.

OpenAI reported the findings today alongside the announcement that GPT-5.5 Instant is rolling out to all free ChatGPT users. More than 230 million people turn to ChatGPT for health every week; that's 230 million people who just got a material upgrade in the quality of advice they receive.

Fewer Failure Modes Than Human Doctors

OpenAI asked physicians to write responses to representative health conversations with unrestricted time and full internet access but no AI assistance. A separate panel of physicians then compared those human-written responses to model outputs from GPT-5.5 Instant and older models.

GPT-5.5 Instant had fewer instances of failing to tailor advice to local healthcare context, missing red flags or referral to care, and neglecting to ask for additional context from the user. The model outperformed both older models and physicians on each of those dimensions.

How OpenAI Measures Health Intelligence

Progress in health requires more than accuracy. The company built two evaluation suites: HealthBench and HealthBench Professional, which use realistic health conversations and physician-written rubrics covering accuracy, safety, communication, context awareness, completeness, and appropriate escalation.

Another measurement comes from production traffic. OpenAI runs privacy-preserving monitors on billions of health messages per week to track factuality issues. Over the last two months, the rate of responses flagged for at least one factuality issue dropped by 71%.

What This Means for 230 Million Weekly Users

GPT-5.5 Instant is now the default model for free tier users. The improvements in recognizing urgency, handling uncertainty, and explaining complex medical information apply to every interaction.

OpenAI's example response to "Why might a doctor recommend an MRI before a steroid injection for sciatica?" shows the model providing a structured, evidence-based answer without overstating certainty. It cites standard medical references and explains when an MRI might not be necessary.

Scaling health intelligence to hundreds of millions of users while reducing factual errors by 71% in two months sets a new bar for what general-purpose language models can do in high-stakes domains. The physician-led evaluation loop gives a replicable template for measuring that progress, not just for OpenAI but for anyone building health-facing AI.

Expect the next round of improvements to come from tightening the feedback cycle between production monitors and physician reviewers, not from scaling model size alone.

Source: Improving health intelligence in ChatGPT
Domain: openai.com

GPT-5.5 Мгновенные ответы на вопросы здоровья имеют меньше ошибок, чем врачи

Fewer Failure Modes Than Human Doctors

How OpenAI Measures Health Intelligence

What This Means for 230 Million Weekly Users

More in Artificial Intelligence