Better alignment, better evaluation: towards a new evaluation paradigm for speech recognition

Related research
A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems
arXiv
2025
Share

If you’ve ever tried to evaluate a speech recognition model, you’ve likely ended up deep in spreadsheets of word error rates and sample transcripts. You might even have felt that something was missing. Why does a model with the lowest WER still produce transcripts that sound awkward, or worse, unsafe in sensitive domains? That’s the problem this work addresses.

Evaluating speech recognition isn’t just about counting word mistakes. It’s about understanding how those mistakes happen, which ones matter, and how they affect the people relying on the model’s output. A misheard filler word might be harmless, but a misheard medication name can change the course of care.

This piece walks through how evaluation can evolve beyond word error rate, grounded in peer-reviewed research and practical examples. It’s a story about better alignment, reproducible analysis, and what it takes to make ASR evaluation truly meaningful.

Limitations of word error rate

Today’s evaluation of automatic speech recognition (ASR) systems remains largely confined to comparisons based on word error rate (WER). The most frequently cited benchmark for both open-source and proprietary speech recognition models - the Hugging Face ASR leaderboard - ranks models by their average word error rate across eight datasets. A closer examination of the underlying technical reports for the top five entries reveals occasional use of downstream task evaluations, hallucination robustness, and character error rate (CER) as supplementary metrics. Is this rather limited evaluation toolkit a problem?

The word error rate is undoubtedly a useful summary metric, providing a rough estimate of how frequently words are likely to be misrecognized, but it also has significant limitations. Yet its dominance has shaped how progress in speech recognition is defined, often prioritizing score improvements over comprehension. Word error rate improvements don't always translate to better user experiences. This is especially critical in domains like healthcare, where even a single misheard phrase can have serious consequences, while another may be entirely redundant.

Word error rate is calculated by finding the minimum number of single-word edit operations it takes to convert a reference transcript to a model-generated transcript, and then divide by the total number of words in the reference. The edit operations include insertions, deletions, and substitutions - an edit distance measure known as the Levenshtein distance. The fact that this distance measure treats words as indivisible units is a root cause of many pitfalls associated with the word error rate metric. We’ll explore these in further detail below.

  • Conventions: In English, many common words and phrases have contracted and informal forms that make writing more concise. Human transcribers often favor these shortcuts. For example, a person might write “I’m kinda ok,” while a speech recognition model might transcribe the same sentence as “I am kind of okay. Although the sentences are semantically equivalent, the word error rate clocks-in at 167%.
  • Exaggeration: The example above also highlights another limitation. The contraction “I’m” and the informal “kinda” require multiple edits to correct. Other errors that replace a single word with multiple words might of course be more severe than the ones exemplified here. However, consistently doubling the penalty for these errors is a crude simplification.
  • Importance: The word error rate does not account for the severity of individual errors nor the relative importance of specific words. If we consider a medical application, unproblematic errors like the ones discussed above will still be penalized hard, while mistaking hydroxyzine (an antihistamine) for hydralazine (a blood pressure medication) will be penalized less, although the latter could have serious consequences.
  • Interpretation: It’s tempting to interpret the word error rate as the relative frequency of misrecognized words, but this can be misleading. A word error rate of 20% does not necessarily mean that one in five words were misspelled. First off, the exaggeration issue discussed above is a common source of inflation. Second, a model may in fact transcribe all spoken words correctly, yet still be penalized for outputting words that were never spoken - so-called hallucinations.

These limitations do not go to show that we should abandon word error rate altogether. 

But they do highlight that, to better understand transcription errors and select the best model for a given application, we need to examine errors more closely. A crucial first step is the ability to accurately capture how a model transcribed individual words.

Unfortunately, the current best-practice for computing such alignments relies on the Levenshtein-based edit operations introduced above. In the following sections, we first outline why this approach is insufficient. We then introduce a novel alignment algorithm and demonstrate how alignment can help us dissect speech recognition performance.

Shortcomings of conventional alignment

Let’s begin by considering what the ideal alignments should look like. Our primary goal is to answer questions such as, “How was each instance of hydralazine spelled?” To achieve this, we need a single alignment operation for every word in the reference transcript. At the same time, we want the flexibility to align a single reference word with multiple or partial words in the model transcript.

As we have already seen, Levenshtein-based alignment fails to capture cases where a single word in the reference corresponds to multiple words in the model transcript. The reverse is also true: multiple words in the reference cannot be aligned with a single word generated by the model.

Another limitation of Levenshtein-based alignment is that it does not account for the quality of individual operations. For example, when a substitution occurs adjacent to an insertion, there is no mechanism to determine which term should be substituted and which should be inserted, leaving the decision essentially to a coin flip.

These limitations arise from the fact that words are treated as atomic units, which does not account for the character composition of individual words. We can get a better understanding of the issue by taking a look at the Wagner–Fischer algorithm which is used to compute the Levenshtein distance and its corresponding alignments.

The algorithm relies on so-called dynamic programming to compute the optimal distance between all possible prefixes of two texts. The process involves constructing a cost table that represents edit distances, which is filled as illustrated in Figure 1 below.

The first row and first column of the table are initialized with numbers incremented from zero up to the length of the text represented on the corresponding axis. Each subsequent cell is then filled recursively by selecting the optimal transition from the three adjacent cells on the upper-left side of the target cell.

A horizontal transition represents a deletion, a vertical transition represents an insertion, and a diagonal transition represents either a match or a substitution, depending on whether the words on both axes are identical for that cell. In this example, all transitions incur a cost of 1, except for a match, which has a cost of 0.

Constructing this matrix only gives us the minimum edit distance, which is the number in the bottom-right corner. To obtain the corresponding alignments, we need to track how we arrived at each cell by keeping so-called back pointers - the arrows between adjacent cells in Figure 1. When we have computed the minimum distance, we can backtrace the optimal alignment by following the back pointers to the starting point in the top-left corner.

Figure 1: In order to extract Levenshtein-based alignments, one has to first compute the Levenshtein distance while registering back pointers (light blue arrows) as the cost table is filled. Multiple alignments may be equivalent (red arrows) as there is no mechanism to decide which words are most likely to be substituted. The final alignment is found by backtracing following on the pointers (green arrows).

You might wonder why we don’t simply perform the alignment at the character level instead of the word level. Although this is a step in the right direction, it would not result in word-level alignments, which is what we are interested in. A character-level alignment would not enable us to answer questions like “How was each instance of hydralazine spelled?”, but instead “How often was the letter i replaced with the letter e?” This is not what we want, so we would still need to devise a set of rules to map the character-level alignments back to words.

Furthermore, the principle of simply minimizing the character-level edit distance does not necessarily yield accurate alignments. Because the number of distinct characters is limited, arbitrary matches are likely to occur, often producing highly implausible alignments. This problem is especially pronounced when working with speech recognition models, which are prone to hallucinating entire words or phrases. Consider a simple example where the reference transcript just reads “column”, but the model transcribes “colum hallucination”. The missing n in column would inevitably be matched to one of the n’s in the word hallucination, producing a clearly incorrect alignment.

A new alignment algorithm

Our approach builds on the final two observations from the previous section. We propose an algorithm that operates at the character level but applies a rule set to produce word-level alignments. And unlike traditional methods, it is not restricted to alignments that directly minimize the edit distance. An entirely unconstrained search is infeasible, which is precisely why dynamic programming has been so widely adopted for this purpose. Instead, we first compute Levenshtein-based alignments at the character level. This provides a foundation for a beam search that remains close to the backtrace alignments but can deviate when necessary. Before processing, we embed each word in angle brackets. For example, the sentence “an example” becomes <an><example>. These brackets serve as markers for the rule set, indicating where each word-level alignment begins and ends. Instead of assigning equal cost to all transitions, we introduce a fine-grained cost function informed by basic linguistic properties. Insertions or deletions of unvoiced characters - such as apostrophes or the angle bracket delimiters - are penalized less than those involving voiced characters. Similarly, substituting a vowel with a consonant is considered less plausible than substituting characters of the same type. The full procedure is illustrated in Figure e below.

Figure 2: The alignment is obtained by first computing the character-level edit distance and constructing the corresponding backtrace (red arrows). A beam search (blue arrows), using the backtrace as an anchor, then identifies the optimal alignment (green arrows). 

We hope this high-level introduction gave you a basic understanding of our work. If you want to dive deeper, check out our full paper on arXiv, the algorithm implementation on GitHub, or the corresponding PyPI package. 

Improving understanding with alignment

Let us now examine how alignments can aid in error analysis and how improved alignments provide a more informative view of the errors. We use the popular Whisper model and apply it to the PriMock57 dataset, a collection of 57 simulated primary care consultations.

The PriMock dataset contains conversational speech, and its reference transcripts reflect the informal nature of these interactions. Terms like “alright,” “gonna,” and “wanna” appear instead of the more formal “all right,” “going to,” and “want to” preferred by the model. In addition, the reference transcripts also include vocalized fillers such as “um” and “uh,” which the model rarely transcribes. These two fillers alone account for nearly 12% of all model errors. The examples above represent relatively common cases - issues that could be detected even without an accurate alignment algorithm. However, rare terms that have little impact on overall word error rate but are critical to transcribe correctly present a different challenge. Consider the alignments for misspelled drug names shown in the table below. The Levenshtein-based alignments are largely uninformative, whereas our method reveals that the transcribed phrases are actually phonetically similar. Integrating a language model more tightly into the speech recognition decoder could help resolve these errors.

We previously described how Levenshtein-based alignment does not account for the qualitative characteristics of individual operations, often resulting in somewhat arbitrary substitutions. To see how this manifests in practice, we can examine the one-to-one alignments that differ most between the two approaches. The table below presents a short selection involving medical relevant information, but the list goes on.

Finally, we can use some of the above observations to provide a more truthful word error rate. By adjusting our normalization procedure to account for the most common differences in conventions regarding informal language, vocalized fillers, abbreviations, as well as discounting for one-to-many substitutions, the overall word error rate decreases from 15.4% to 10.2%. If a paper reported a similar improvement resulting from a new modeling or optimization technique, it would likely draw considerable attention.

Next steps

This work is part of a larger, ongoing effort to improve and simplify the evaluation of speech recognition models. The examples above highlight how more accurate alignment represents a crucial step in that direction. However, a toolkit for bespoke analysis is not the ultimate goal. Instead, accurate alignment should be integrated into standardized and reproducible workflows that enable practitioners to efficiently break down and interpret model performance.

Reproducibility is a keyword here. A natural question arises: why not use a large language model to address the alignment problem described earlier? After all, some terms are nearly impossible to align correctly without broader world knowledge. Take for instance the word “percent” and its corresponding symbol %. While a large language model might outperform a deterministic algorithm in such cases, it also raises questions about which model to use and how that choice affects the comparability and reproducibility of different evaluation runs. Not to mention the substantial inference cost.

This is not to suggest that large language models have no role to play. Rather than performing the evaluation itself, they can support the process by preparing data for more detailed analysis. For example, a language model might help identify which words are critical versus redundant, or generate alternative forms of informal expressions and abbreviations, as discussed above.

With a standardized framework for releasing evaluation data that includes language model enriched transcripts, computing metrics such as medical term recall (MTR) becomes much more accessible. Medical term recall focuses specifically on the most critical terms in the reference transcripts, allowing practitioners to quantify how well a model captures the terms most important for their application.

As we continue this work, our goal is to develop a transparent and reproducible evaluation framework for speech recognition. A framework that combines accurate alignment, flexible analysis, and thoughtful use of language models to improve how model performance is understood and compared. Follow along for more.

Resources