Transkripce

Transcription Accuracy: What WER Really Measures and Why Marketing Numbers Fall Short

"95% accuracy." This number appears in the marketing materials of almost every transcription service. What exactly does it measure, what data was it based on, and what does it say about your specific recording? The answers are less comfortable than they might seem — but far more useful.


How Transcription Accuracy Is Measured

Transcription accuracy is evaluated by comparing the output against a reference text — a manually created transcript of the same recording that serves as the "gold standard." Deviations of the automatic transcript from the reference are then quantified using a metric.

Word Error Rate — The Most Widely Used Metric

WER (Word Error Rate) is the de facto standard for evaluating transcription systems. The formula:

WER = (S + D + I) / N x 100

where S = word substitutions, D = deleted words, I = inserted words, N = total number of words in the reference text.

Example: the reference text is "Mr. Smith will arrive on Tuesday at nine o'clock" (9 words). The model returns "Mr. Smith will arrive on Tuesday at ten o'clock" — one word wrong (substitution of "nine" for "ten"). WER = 1 / 9 x 100 = 11.1%.

What WER captures: word substitutions, missing words, extra words. What WER does not capture: wrong word order (if the output contains the right words in the wrong sequence), punctuation, capitalization, or diacritical marks.

Character Error Rate — Better for Morphologically Rich Languages

CER (Character Error Rate) counts errors at the character level rather than the word level. For morphologically complex languages, this has an advantage: a single misspelled word in English and in a language like Czech carries different weight in WER — but CER distributes the error proportionally.

Example: "hospital" vs. "hospiral" — WER = 100% (entire word wrong), CER = 1/8 = 12.5% (one character wrong). CER better reflects the actual error rate for languages with long, inflected words.

Less Common but More Precise Metrics

Match Error Rate (MER) and Word Information Lost (WIL) are methods that combine aspects of WER with emphasis on information loss. Morris et al. (2004) showed that these metrics correlate better with subjective quality assessments of transcriptions. In practice, you will mostly encounter WER — but knowing that better alternatives exist is useful for critically reading benchmarks.


Why Marketing Numbers Don't Reflect Your Recording

The accuracy number depends on the test set. Test sets used in marketing materials are chosen to make results look as good as possible. And they are far removed from the conditions of a typical user recording.

Standard Test Sets and Their Limits

LibriSpeech is a foundational test set for ASR research. It contains 960 hours of English audiobook readings — in studio conditions, one speaker per file, clean audio, standard English vocabulary. Whisper large-v3 achieves a WER of approximately 2.7% on LibriSpeech-clean (Radford et al., 2022). That is an exceptional result — but under exceptionally favorable conditions.

TED-LIUM consists of TED talk recordings. Conditions are more realistic than LibriSpeech, but it is still prepared, grammatically correct English speech in a professional recording environment.

What these datasets do not contain: non-English languages, office noise, informal conversational style, spontaneous multi-speaker conversation, domain-specific terminology. Panayotov et al. (2015) described LibriSpeech as a carefully designed dataset — but its conditions are deliberately controlled.

Your Recording Is a Different World

The same Whisper large-v3 model that achieves 2.7% WER on LibriSpeech can reach 20-35% WER on spontaneous conversational speech from a meeting with background noise. This is not a model failure — it is fundamentally different conditions.

What the marketing number does not capture:


What Actually Affects Transcription Accuracy in Practice

Five factors determine transcription accuracy. Four of them are on the recording side — only the last is on the model side.

1. Audio quality: SNR (signal-to-noise ratio) is the strongest predictor of transcription accuracy. The difference between 5% WER in a quiet environment and 20-25% WER in a noisy one is consistently documented in ASR research. For tips on preparing your recording for best results, see A12.

2. Language and dialect: The further from the training data, the worse the result. Non-English languages are disadvantaged by smaller data volumes. Regional accents or dialects complicate things further.

3. Speech rate and fluency: Fast, overlapping, or interrupted speech increases error rates. Spontaneous conversation is significantly harder for models than prepared speech.

4. Vocabulary: Terms outside the model's training data — medical, legal, internal company jargon — the model guesses. The result can be surprisingly good (the language model infers from context) or poor (the model prefers a phonetically similar common word).

5. Model selection: Only after considering the first four factors. Different models have different strengths — but no model can overcome the physical limits of a poor recording.


How to Evaluate Accuracy in Practice — Without a Lab

The best evaluation is your own testing on your own recording. Three approaches require no complex setup.

Manual Comparison on a Sample

Manually transcribe 2-3 minutes of your recording. Compare with the automatic transcript. Count substitutions, deletions, and insertions. The result is the WER for your specific recording type with this specific model.

Advantage: most accurate for your case. Disadvantage: time-consuming. Recommendation: test on a representative sample — not the easiest part of the recording, but a typical one.

Confidence Score as a Guide

Transcription models assign each transcribed word a confidence score (0-1 or 0-100%). Words with low scores are candidates for error — prioritize checking these.

Limitation: the model can have a high confidence score even for an incorrectly transcribed word. High confidence does not mean correctness — confidence reflects the model's internal belief, not the truth of the result. For more on confidence scores, see A19.

Comparing Multiple Models on the Same Recording

Run the recording through two or three different models. Where results agree — likely correct. Where they differ — verify by listening.

This method requires no reference transcript. It is a practical alternative for quick quality assessment. It also forms the basis for result fusion A13 and for quality evaluation without a reference text A35.


Conclusion

An accuracy number on a website is a directional indicator — not a guarantee. For the number to be meaningful, you need to know: what language it was measured on, under what acoustic conditions, and with what vocabulary.

Instead of searching for the "most objective" benchmark: invest thirty minutes in your own test. Record your typical type of audio, run it through the model you are evaluating, and count where it fails. That is information relevant to your needs — not to the conditions of an English lab.


References

  1. Morris, A., Maier, V. & Green, P. (2004). From WER and RIL to MER and WIL: Improved Evaluation Measures for Connected Speech Recognition. INTERSPEECH 2004.
  2. Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. ICASSP 2015. [doi:10.1109/ICASSP.2015.7178964]
  3. Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv. [doi:10.48550/arXiv.2212.04356]