Symphony outperforms OpenAI on HealthBench

Corti Symphony outperforms OpenAI on their own healthcare benchmark HealthBench Pro

Published

May 13, 2026

Written by

Corti

Healthcare’s AI Lab

Resources

Benchmarks Corti for Startups

Corti Symphony outperforms OpenAI on their own healthcare benchmark, surpassing every major LLM provider across clinical reasoning tests
The capability OpenAI shipped as a proprietary clinical product is reproducible on Corti's infrastructure today, by any developer.
For the next generation of builders, Corti for Startups is open to pre-seed through Series B teams. Up to $5,000 in API credits. Apply here.

Corti Symphony outperforms OpenAI on their own healthcare benchmark HealthBench Pro

Last week, OpenAI launched ChatGPT for Clinicians, a free tool for verified physicians, and published HealthBench Professional, an open benchmark built to measure AI performance on real clinician tasks. Their headline result: GPT-5.4 with extended reasoning outperforms all other models tested, including Claude, Gemini, and Grok, and outperforms specialist physicians given unlimited time and web access.

We ran Corti Agents on it. Corti outperformed every major LLM provider by at least 25%.

Corti Symphony: 60.5. GPT-5.4 with extended high reasoning: 48.1. ChatGPT for Clinicians, OpenAI's tailored clinical product, reaches 59.0. Corti beats that too, at half the cost, on infrastructure any developer can access today.

Last month, we outperformed Anthropic on medical coding. This week, OpenAI on clinical agents. The pattern is not a coincidence.

General-purpose models are trained to be broadly capable. That is their design. They perform well across a wide distribution of tasks because they are optimized for exactly that. Healthcare is not a wide distribution. It is a narrow, high-stakes domain with specific failure modes, specific documentation requirements, specific ways that ambiguity becomes dangerous, and essentially zero tolerance for confident errors.

Corti has spent years building inside that domain, not around it. Symphony is trained on clinical data, evaluated by physicians, and tested in real clinical workflows. The Agentic Framework was designed specifically for the constraints of clinical environments: governed orchestration, domain-specific reasoning, full auditability at every step. These are not features added to a general model. They are the architecture of a specialized one.

When a benchmark rewards clinical judgment under pressure, Corti wins because the model was built for clinical judgment under pressure. When a benchmark tests safety on adversarial prompts designed to probe the edges of clinical reasoning, Corti wins because those edges are where we do our best work.

What HealthBench Professional measures and how Corti outperforms the bunch

Most healthcare AI benchmarks use multiple-choice exams from medical licensing tests. They measure recall. They do not measure what happens when a clinician is mid-consult, working through ambiguous symptoms, deciding whether to escalate, and needs to document what they found.

HealthBench Professional measures that. It is a rubric-graded benchmark of 525 physician-authored tasks drawn from real clinician conversations, organized across three use cases: care consult, writing and documentation, and medical research. Every response is scored against criteria written and adjudicated by three or more physicians across three rounds of review.

One third of the benchmark is red teaming: adversarial prompts designed to surface failure modes, stress-test clinical judgment, and probe safety boundaries. OpenAI deliberately enriched the hardest examples by 3.5x and compared every model against specialist physicians with unlimited time. It was not designed to be easy to beat.

The margin on care consult, the largest category and the one closest to actual clinical decision-making, is wider still. Corti scored 63.5. ChatGPT for Clinicians scored 51.0.

Safety is where the gap becomes hardest to ignore. On the adversarial red teaming slice, Corti scored 87.7. GPT-5.4 scored 30.3. Human physicians scored 30. On a benchmark OpenAI built to test the limits of clinical AI, Corti is the safest model under the most difficult conditions.

What we built in one week

One week after the benchmark dropped, the Corti team had built a prompt and configuration on the Agentic Framework that replicates the full ChatGPT for Clinicians experience at higher accuracy. The capability OpenAI ships as a proprietary application is reproducible by any developer on Corti's infrastructure today.

Healthcare is not a generic chatbot market. It requires agents built for clinical reality: messy workflows, operational constraints, safety demands, coding, documentation, triage, and the edge cases that define the domain. The benchmark OpenAI published to showcase their own model is now a public reference point for what Corti makes available to every builder in the ecosystem.

Introducing Corti for Startups for the next generation of healthcare builders

OpenAI's launch of ChatGPT for Clinicians concerned a lot of builders. The worry is reasonable: if OpenAI ships a proprietary clinical product that developers cannot replicate through the API, the path to building competitive healthcare AI narrows fast.

This result is the answer to that concern. If Corti Agents can beat ChatGPT for Clinicians on OpenAI's own test, so can anything built on Corti's infrastructure. The capability is not locked inside a proprietary application.

That is what the Corti Startup Acceleration Program is for. No equity, grant-funded, open to pre-seed through Series B teams building healthcare AI anywhere in the world. Accepted founders get up to $5,000 in API credits, direct support from Corti's team, and dedicated architecture time before a line of code is written.

The path to building something more capable than ChatGPT for Clinicians is open. We just proved it.

Apply at corti.ai/corti-for-startups.

‍