From everyday language to clinical precision: introducing Medical Term Recall

Corti’s French ASR brings a new level of accuracy to healthcare transcription.
That means more reliable AI deployed in healthcare, more capable clinical applications, and more builders who can innovate confidently within regulated environments. In benchmark evaluations, the model demonstrates roughly a 40% reduction in clinically critical term errors compared to leading general-purpose systems, establishing a new standard for clinical speech recognition.
To understand why this gap exists for generalist speech recognition systems and how we built our domain-specific proprietary speech recognition pipeline to close it, it helps to look at the language these systems are built on.
It is estimated that a typical 20-year-old native English speaker knows around 42,000 dictionary words (Brysbaert et al., 2016), though only a fraction of these are used actively in daily conversation. Some studies even estimate far fewer words, suggesting that much of our communication draws from a relatively small, familiar core vocabulary (Milton & Hopkins, 2013).
This is, to a large extent, the language that general speech recognition systems are trained on. Datasets drawn from platforms like YouTube, podcasts, or call centers reflect the way people typically speak - covering conversational English, pop culture, and common expressions - but rarely delving into the specialized lexicon of expert domains.
When the vocabulary explodes
In contrast, medicine is a linguistic universe of its own.
In the English version of SNOMED CT, for instance, you’ll find over 360,000 active clinical concepts, each potentially associated with multiple synonyms, abbreviations, and spelling variants. When combined with other medical ontologies such as ICD-10, RxNorm, and LOINC, the full English medical lexicon easily exceeds one million distinct terms.
This explosion in vocabulary isn’t just about numbers - it’s about meaning density. A single misheard syllable can flip a diagnosis: infection becomes infarction, ceftriaxone becomes ceftazidime. These are not harmless slips; they can change the course of patient care.
Scaling speech recognition for healthcare
To build a speech recognition system that truly scales within healthcare, the model must do more than capture fluent English - it must excel at recognizing and reproducing medical terminology accurately.
Evaluating such a model requires audio datasets rich in medical vocabulary.
Each dataset should include:
- Speech segments that span the diversity of medical entities (diagnoses, drugs, procedures, dosages, and abbreviations).
- Multiple examples of each term across accents, environments, and speakers.
- Structured data splits for training and evaluation - so performance can be measured through cross-validation and not just memorization.
Only then can we begin to measure what truly matters in clinical ASR: not just how few words the system misses, but how well it captures the clinically critical ones.
Moving beyond WER
Medical Term Recall (MTR) measures how well a model correctly identifies and transcribes clinically relevant entities, including medications, diagnoses, procedures, and dosages.
Instead of penalizing all substitutions equally, MTR isolates the subset of terms that carry clinical meaning and downstream impact.
Formula:
This isolates clinical comprehension from general lexical accuracy and produces a more meaningful signal for regulated medical ASR systems.
Anecdotal failure modes MTR is designed to surface
Diagnosis term collisions
- “myocardial infarction” → “myocardial infection,” preventing a CDI prompt for acute MI
- “pulmonary embolism” → “PE,” later normalized to “physical exam”
Medication confusions with similar phonetics
- “metoprolol” → “metoclopramide”
- “ceftriaxone” → “ceftazidime”
Dosage and unit errors
- “0.5 mg” → “5 mg”
- “mcg” → “mg”
Procedure and device terms
- “endotracheal intubation” → “intubation”
- “PCI” expanded incorrectly to “pacemaker insertion”
Abbreviations and accents
- “NSTEMI” → “STEMI” in noisy environments
- “sats 88 on room air” → “stat 88”
These are low-frequency relative to function words but high-impact for patient safety and downstream processes. WER underweights them. MTR measures them directly.
Results
In controlled evaluations over more than 150,000 validated medical entities, Corti’s domain-trained models achieved:
- 78% Medical Term Recall (MTR)
- 48 – 59% MTR for leading general-purpose ASR systems evaluated under identical conditions
This corresponds to roughly a 40% reduction in clinically critical term errors, improving documentation accuracy and structured data extraction.
Why it matters
As frameworks such as the EU AI Act and FDA Good Machine Learning Practice emphasize explainability and risk-based evaluation, developers need metrics that demonstrate clinical reliability. WER cannot indicate whether a model is safe for medical use. MTR provides a targeted signal for clinical language understanding.