Symphony for speech-to-text

Medical speech to text API built for clinical language

Symphony for Speech-to-Text is clinical speech recognition for healthcare developers. Real-time medical dictation, ambient documentation, and batch audio, with structured transcripts via REST and WebSocket.

Get API key

Meet with an Expert

98.6%

Word accuracy. 1.4% Word Error Rate in English realtime

Medical dictation benchmarks in English, French, and German

98.5%

Formatted entity recall on dosages, units, measurements

50%

Fewer missed terms with keyterm biasing

Trusted for > 1 million interactions every week

symphony for speech-to-text

Clinical speech recognition API

Three endpoints supporting every clinical audio workflow. All three run through the same underlying pipeline: medical recognition, structured formatting, and contextual correction.

/transcribe

Real-time stateless dictation

WebSocket streaming for low-latency dictation. Send audio, receive interim and final transcripts. Built for EHR navigation, structured data capture, and voice UI control.

View docs

/streams

Real-time stateful transcription

WebSocket streaming for conversational clinical audio tied to an ongoing session. Speaker diarization, audio health events, and contextual correction run natively. Built for ambient documentation workflows.

View docs

/transcripts

Async batch processing

REST endpoint for pre-recorded audio and asynchronous processing. Upload files and retrieve transcripts at scale. Same quality, no streaming infrastructure required.

View docs

An orchestrated pipeline, not a single-pass model

Symphony treats speech to text as a structured inference problem. Recognition, formatting, and correction run as an orchestrated pipeline rather than collapsing into one step. The result is text that is clinically accurate and operationally ready.

Stream in real time or upload batch

Stateless real-time dictation, stateful conversational transcription, and async batch processing all run through the same underlying model. Build across modes without managing separate vendors or accepting differences in accuracy.

Inject context, bias toward the keywords that matter

Supply a terminology list at inference time, and Symphony biases recognition toward the terms in your context - rare drug names, facility abbreviations, clinician identifiers. No fine-tuning, no retraining cycle. Reduces missed terms by ~50% without affecting precision.

Catch quality problems before they reach the transcript

Symphony Audio Health Events surface real-time quality signals during the interaction, so your pipeline can prompt a correction rather than surface a bad transcript after the fact.

Structured outputs your agents can act on

Entities are structured, punctuation is rendered, dosages and units are in the right form. Your agents and downstream systems can act on the transcript without a post-processing layer.

Benchmarks

The most accurate speech-to-text for clinical use cases

Best-in-class performance across medical terminology, formatting, and dictation commands.

More Benchmarks

Compare STT APIs

Leading in keyword accuracy

0.43

0.48

0.52

0.58

0.59

0.74

0.59

0.58

0.52

0.48

0.43

98.6%

83.6%

82.6%

81.9%

81.4%

81.1%

SymphonyRealtime

AWSOffline

OpenAIRealtime

ElevenLabsRealtime

GoogleOffline

ParakeetRealtime

1.4%

Word Error Rate. Outperforming ElevenLabs, OpenAI, AWS, NVIDIA, and Google by >20%, both in realtime and offline English benchmarks.

98%

Formatting Accuracy. So dosages, units, dates, and measurements always render correctly for downstream use.

25%+

Improvement vs. legacy dictation. Symphony WER
4.1% vs. Dradon Medical One 5.7% WER on Medical Dictation benchmarks.

What builders say

"By adding Corti’s API directly into our platform, we’re giving customers the latest capabilities without forcing them to learn new systems or abandon familiar workflows."

Dr. Thomas Brauner

CEO, Speech Processing Solutions

"In a clinical conversation, every word matters - a missed medication name, a misheard dosage, or a mistranscribed symptom can change the meaning of an encounter. Symphony’s accuracy on clinical terminology gives us the foundation to bring more trusted AI capabilities into clinical workflows with our Voicepoint Xenon® platform"

Pierre Corboz

Head of Solutions & Business Development, Voicepoint

What's included

Commands

Control your application interface by voice, using spoken words to trigger actions and navigate fields.

Formatting

Render dosages, units, dates, and measurements in clinical form automatically. No post-processing needed.

Diarization

Attribute speech to the right speaker across multi-participant conversations.

Interim results

Show transcript output in real time as the clinician speaks, word by word.

Audio health events

Surface input quality signals in the API response, before a bad transcript reaches your users.

Auto/spoken punctuation

Handle punctuation at the recognition layer for a natural dictation experience.

Replacement rules

Control how words, phrases, and acronyms appear in final transcript output.

Custom dictionary

Improve recognition for proper nouns, facility names, and specialty-specific terminology.

Production-ready

Global coverage across 14+ languages

One integration covers live streaming dictation, conversational transcription, and batch audio. No separate vendors. No accuracy tradeoffs between modes.

Read the docs

Build any voice-powered healthcare workflow

Symphony supports dictation, ambient documentation, and agentic workflows through a single API. No separate systems. No accuracy tradeoffs between modes.

Stream a conversation. Receive a structured transcript.

Connect to /streams for stateful, real-time conversational transcription. Audio is associated with an ongoing interaction - diarization, contextual correction, and audio health events run natively in this mode.

Speaker diarization segments the transcript by doctor and patient automatically

Audio health events surface input quality issues in real time, before the encounter ends

Get started with the Corti SDK in JavaScript and C# .NET

Low-latency transcription built for command-and-control.

Connect to /transcribe for stateless, real-time dictation. Built for command-and-control workflows where spoken commands control the interface, edit text, and trigger formatting operations.

Industry-leading accuracy on medical dictation benchmarks, outperforming Dragon Medical One by 20%+

Spoken punctuation, measurements, and abbreviations handled at the recognition layer

Supply a keyterm list at inference time to bias recognition toward your vocabulary

Give your agents clean, structured input. Not raw text.

Symphony delivers structured, validated outputs your agents can act on directly. It handles the gap between what a clinician says and what your software needs to do.

Structured outputs let voice directly drive workflows, orders, and documentation

Context injection and controllable outputs mean your agent starts from verified input

Controllable outputs reduce the surface area for LLM hallucination downstream

Compare

A speech-to-text pipeline built for healthcare

Not adapted from general audio.

General-purpose APIs don't solve for clinical use.

VS. GENERIC ASR

Symphony

Generic ASR

Medical term accuracy

Native

Real-time formatting

Native

Spoken punctuation

Native

Custom commands

Native

Speaker diarization

Native

Partial

Audio health events

Native

Legacy software wasn't designed for builders.

VS. LEGACY DICTATION

Symphony

Legacy Dictation

Medical term accuracy

Best-in-class

Real-time formatting

Native

Flexible developer API

Yes

Embeddable in your app

Yes

Speaker diarization

Native

Structured outputs

Agent-ready

Cursor input

Read the research behind Symphony for Speech-to-Text

Nine years of peer-reviewed research, published at NeurIPS, ICML, ICLR, and ACL. Now shipping as an API.

Read the research

Start building with Symphony for Speech-to-Text

$50 free credits. Full API access. No card required.

Get API key

Meet with an expert

Frequently asked questions

Meet with an Expert

How accurate is Symphony for Speech-to-Text?

Symphony is top-ranked on medical dictation benchmarks across English, French, and German, and is available across 14 languages worldwide. It outperforms OpenAI, ElevenLabs, Whisper, and Parakeet on clinical audio, and matches or exceeds state-of-the-art on general-purpose benchmarks.

What does Symphony for Speech-to-Text cost?

$0.0065 per audio minute, all inclusive. Diarization, contextual correction, keyterm biasing, audio health events, multichannel speaker attribution, and all supported languages are included at no extra cost. No separate charges for transcript output.

10 minutes = $0.07. 1 hour = $0.39.

Is Symphony HIPAA compliant?

Yes. Symphony is built for healthcare environments, with sovereign cloud deployment options for organizations with strict data residency requirements.

How does Symphony handle hallucinations?

Symphony is stress-tested on non-speech audio and aggressively segmented inputs. It reports lower spurious insertion rates than every major competing system - an important signal for production clinical deployments.

How do the three Symphony for Speech-to-Text endpoints differ?

/transcribe is a stateless WebSocket endpoint for real-time dictation and voice-controlled interfaces. /stream is a stateful WebSocket endpoint for conversational transcription, where audio is associated with an ongoing interaction. /transcripts is an async REST endpoint for batch processing of pre-recorded audio files. All three share the same underlying pipeline with no performance differences between modes.

How does Corti Symphony for Speech-to-Text compare to Dragon Medical One?

On MedDictate, a realistic English medical dictation benchmark, Symphony achieves 4.6% WER, compared to 5.7% for Dragon Medical One, with higher medical term recall and a lower false discovery rate (0.79% vs. 1.33%). Unlike front-end dictation applications, Symphony is an API built for developers - no client-side software, no vendor lock-in, and structured outputs that downstream systems can act on directly.

How do I evaluate Symphony?

Every developer gets $50 of API credits to start - enough to process nearly 8,000 minutes of audio.

Get an API key, run your own audio through the appropriate endpoint, and measure results against a gold standard transcript using Corti Canal, our open-source evaluation tool that reports Word Error Rate, Character Error Rate, and Medical Term Recall.

If you want help setting up a benchmark or interpreting results, reach out to help@corti.ai.

Medical speech to text API built for clinical language

Clinical speech recognition API

An orchestrated pipeline, not a single-pass model

Stream in real time or upload batch

Inject context, bias toward the keywords that matter

Catch quality problems before they reach the transcript

Structured outputs your agents can act on

The most accurate speech-to-text for clinical use cases

Leading in keyword accuracy

Word Accuracy Rate

What builders say

What's included

Global coverage across 14+ languages

Build any voice-powered healthcare workflow

Stream a conversation. Receive a structured transcript.

Low-latency transcription built for command-and-control.

Give your agents clean, structured input. Not raw text.

A speech-to-text pipeline built for healthcare

General-purpose APIs don't solve for clinical use.

Legacy software wasn't designed for builders.

Read the research behind Symphony for Speech-to-Text

Build the next generation of dictation applications. Without the legacy constraints.

More from Corti on Speech-to-Text

Symphony for Speech-to-Text: Behind the research supporting real-time medical voice interfaces

Why voice-first healthcare AI needs medical-grade speech-to-text pipelines

Better alignment, better evaluation: towards a new evaluation paradigm for speech to text

Start building with Symphony for Speech-to-Text

Frequently asked questions

How accurate is Symphony for Speech-to-Text?

What does Symphony for Speech-to-Text cost?

Is Symphony HIPAA compliant?

How does Symphony handle hallucinations?

How do the three Symphony for Speech-to-Text endpoints differ?

How does Corti Symphony for Speech-to-Text compare to Dragon Medical One?

How do I evaluate Symphony?