Independent benchmarks across medical coding, speech recognition, clinical reasoning, and agentic AI. Tested head-to-head against the largest AI labs in the world.
Clinical tasks have no margin for error. A model that scores well on general benchmarks performs very differently when the task is ICD-10 coding, clinical speech recognition, or multi-step reasoning over a patient record. The gap between claimed and measured performance is where most healthcare AI falls apart.
Corti's benchmarks are run on real clinical tasks, against real inputs, head-to-head with the largest AI labs in the world. The results are published so you can verify them yourself.
No. The dataset deliberately overrepresents difficult and adversarial examples by roughly 3.5x. A 60% score here can coexist with strong performance in typical clinical use. The benchmark is a stress test, not an average-case measurement.
About a third of the benchmark consists of physicians deliberately trying to break models. Strategies include false premises, role-play framing, and presenting questionable diagnoses as fact. Symphony scores 59.0 on this subset against GPT-5.4's 30.3. That gap reflects robustness under adversarial clinical pressure, not just routine performance.
Three categories reflecting real clinical workflows: care consult (differential diagnosis, treatment reasoning), writing and documentation (note generation, coding, patient messaging), and medical research (synthesizing and finding clinical evidence).
Yes. Symphony is built for healthcare environments, including support for sovereign cloud deployments for organisations with strict data residency requirements.
2% Word Error Rate (WER) on clinical speech across 150,000+ medical terms and 14 languages. In practical terms, that's three to four times fewer errors than the next best alternatives.
It runs in real time during the consultation rather than after. By the time the visit ends, structured facts are already extracted, validated, and timestamped.