Building better speech recognition for healthcare: What models get wrong, and how we fix it

The models that power AI in healthcare don’t just need to be accurate. They need to be accountable, consistent, and explainable. Especially when they’re handling clinical conversations.
That’s why we invest heavily in infrastructure for speech recognition in healthcare. To ensure the systems we (and others) build are safe for clinical workflows, scalable across languages, and flexible enough to adapt to specialized medical vocabularies.
In a recent conversation at Corti, three members of our ML and research team—Lasse, Jakob, and Lana—sat down to unpack some of the challenges and trade-offs they see when working with open-source models, and what that means for teams building on top of speech recognition APIs in healthcare.
Hallucinations in ASR are real (and in some cases, dangerous)
Unlike LLMs, where the term “hallucination” often just means a factual error, ASR hallucinations are exactly what they sound like: the model invents text that wasn’t said. That might mean transcribing “subtitles by Jane Doe” at the end of a clinical sentence, because it learned that pattern from subtitle training data. But in healthcare, it can be much worse.
Imagine a model hallucinating a dosage instruction or omitting a symptom. Whether it’s adding or removing language, the stakes are high.
The team discussed how hallucinations typically stem from two things:
- Weakly labeled training data, where transcripts don’t match the audio
- Model architectures. Particularly AED (Attention-based Encoder-Decoder) models like Whisper, which are flexible but prone to over-generation
Other architectures, like CTC or transducers, tend to hallucinate less. But they also come with trade-offs in flexibility and multilingual support.
Why open-source models aren’t always real-time, and why that matters
For healthcare applications like dictation, triage assistants, or live documentation, low-latency transcription isn’t optional—it’s essential. But many open-source like Whisper weren’t designed with that in mind.
Yes, Whisper can transcribe faster than real-time. But what users care about is word-level latency: how quickly does the system display the word they just said? Whisper’s architecture, which depends heavily on context and long audio chunks, isn’t well-suited for this kind of responsiveness.
This matters for workflows with spoken commands (e.g. “insert template for shoulder pain”) or environments like emergency calls, where system response speed is critical. Alternative model types, such as CTC or streaming transducers, handle this much better, and we use them extensively at Corti where latency is non-negotiable.
Medical vocabulary requires tailored models
Generic speech models don’t speak the language of medicine. Out of the box, most will fail to correctly transcribe critical medical terms, acronyms, or specialized phrasing, especially across accents, dialects, and healthcare systems.
To solve this, our team fine-tunes Corti’s models using synthetic data, generated from real-world clinical notes. By using text-to-speech (TTS) systems to create paired training examples, we can teach our models to recognize and spell domain-specific language correctly.
Even better: this doesn’t require costly audio collection. Text-based adaptation (especially when structured and contextual) offers a scalable path to localization across hospitals, regions, and specialties.
Decoder-only models are a promising direction
One of the newer approaches discussed was decoder-only ASR models, which allow real-time prompting with custom vocabulary, acronyms, or output formatting (like writing dates in specific ways). These models treat ASR more like a language modeling problem conditioned on audio and are easier to adapt without full retraining.
While they’re still early-stage, decoder-only ASR could be especially valuable in healthcare, where user-specific terms and behavior are the norm, not the exception.
What this means for builders
For teams building AI-powered healthcare tools, this discussion points to a few clear takeaways:
- Architecture choice matters: If your use case depends on low latency or transcription accuracy in high-stakes environments, you’ll need more than a general-purpose ASR model.
- Training data drives risk: Using models trained on weakly labeled or subtitle-style data without adapting it for clinical use increases the risk of hallucinations and omissions.
- Specialization is essential: General-purpose ASR doesn’t cut it for medicine. Fine-tuning with real-world vocabulary and user context is the only way to ensure safe, useful transcription.
- The API is the interface, but not the answer: A strong ASR API is only as good as its underlying architecture and data discipline. Healthcare AI needs infrastructure.
Whether you’re working on documentation, decision support, or voice-first apps, the infrastructure matters. And at Corti, that’s what we build.