In a sign of recursive capability improvement, OpenAI found that its model-based grader for the HealthBench evaluation benchmark was more accurate and consistent than the average human physician performing the same grading task. This demonstrates that models can not only perform a task but also evaluate that performance at a superhuman level, a key component of scalable oversight.
As AI models become adept at identifying novel or experimental treatments for individuals, it will create mounting pressure on the medical regulatory system. Patients, armed with compelling, AI-generated arguments for a specific therapy, will increasingly challenge the gates kept by establishments, potentially forcing an evolution of the social contract around access to unproven medicines ('right to try').
To ensure model robustness, OpenAI uses a "worst at N" evaluation metric. They sample a model's output multiple times (e.g., 20) on a given problem and measure the performance of the single worst response. This focuses development on eliminating low-quality outliers and ensuring a high floor for safety and consistency, rather than just optimizing for average performance.
To lower the activation energy for user adoption, OpenAI deliberately will not use data connected to ChatGPT Health to train its foundation models. This strategic choice is designed to remove any tension between privacy and utility, assuring users their sensitive information is not being used for other purposes and building the trust necessary for scaled impact in the healthcare domain.
Contrary to fears that reinforcement learning would push models' internal reasoning (chain-of-thought) into an unexplainable shorthand, OpenAI has not seen significant evidence of this "neural ease." Models still predominantly use plain English for their internal monologue, a pleasantly surprising empirical finding that preserves a crucial method for safety research and interpretability.
The scale of AI adoption in healthcare is not a future projection but a current reality, with over 230 million people using ChatGPT for health and wellness queries every week. This massive, existing user base establishes it as one of the fastest-growing use cases and reframes the challenge from driving initial adoption to scaling impact and ensuring safety for a global audience.
The 'Overton window' of trust in AI for health is shifting much faster for consumers than for doctors. Patients are rapidly adopting tools like ChatGPT, often introducing the technology to their physicians. This dynamic creates a bottom-up adoption pressure and means the initial challenge is not convincing health systems, but managing the interactions between AI-empowered patients and not-yet-AI-empowered clinicians.
The utility of collecting personal health data from wearables (like a WHOOP band) is not static; it compounds over time as AI model intelligence increases. Data that yields minor insights today could unlock profound health predictions in the future, creating a new incentive for consumers to start gathering longitudinal data on themselves now, even if the immediate benefit seems marginal.
OpenAI's health division serves a dual purpose: delivering societal benefits and providing a real-world, high-stakes environment for AI safety research. Problems like scalable oversight (supervising superhuman AI) move from theoretical exercises to practical necessities when models outperform physicians on narrow tasks, creating concrete feedback loops that accelerate safety progress.
In a partnership with Kenya's Penda Health, OpenAI conducted the first randomized controlled trial of an LLM co-pilot for physicians. The study demonstrated a statistically significant improvement in diagnosis and treatment outcomes for patients whose doctors used the AI assistant. This provides crucial real-world evidence that AI can move beyond lab benchmarks to tangibly improve care.
Frontier AI models excel in medicine less because of their encyclopedic knowledge and more because of their ability to integrate huge amounts of context. They can synthesize a patient's entire medical history with the latest research—a task difficult for any single human. This highlights that the key to unlocking AI's value is feeding it comprehensive data, as context is the primary driver of superhuman performance.
Rather than relying on a small group of experts, OpenAI has built a three-tiered system involving over 260 physicians. This includes high-level strategic advisors, a large cohort for data operations like red-teaming and comparison tasks (communicating via Slack), and a core group of close advisors who translate this collective expertise into concrete evals and training data for researchers.
In a move prioritizing access over monetization, OpenAI plans to offer its reasoning-level ChatGPT Health product to all users for free, without ads or rate limits. This represents an early form of 'universal basic intelligence' and a deliberate strategy to build trust and maximize societal benefit in a high-stakes domain, separating its health impact work from other company incentives.
