How to Evaluate Transcription Quality Without a Reference Text
WER (Word Error Rate) is the gold standard for evaluating automatic transcription accuracy. But it requires a manually transcribed reference text — for every recording you want to know about. In production, where the system processes hundreds or thousands of recordings per day, creating reference texts is economically unfeasible. Yet you need to know whether your transcription is working correctly. There are approaches that estimate quality even without a reference text — each with different assumptions, advantages, and limitations.
Why WER Is Not Enough for Production Monitoring
WER is calculated as the ratio of the number of errors (substitutions, deletions, and insertions) to the total number of words in the reference text: WER = (S + D + I) / N. To compute it, you need a manually verified reference transcript for each recording. This makes sense during model development, benchmarking, or evaluating fine-tuning — it is a one-off or planned expense with clear value.
For ongoing monitoring of a production system, the situation is different. If your system processes 500 calls per day and 1 minute of audio requires 3 minutes of manual review, you need several full-time positions just for reference texts to achieve 100% coverage. This eliminates the benefit of automation.
When to use WER: when selecting or comparing models, when evaluating fine-tuning, and during regulatory audits that require documented accuracy. For production monitoring, alternatives are necessary.
Approach 1: Confidence Score as an Approximate Metric
What Confidence Score Expresses
Most modern transcription models assign each recognized word or segment a confidence score in the range 0 to 1. Technically this is the softmax probability from the output layer of the decoder — not a calibrated statistical probability, but a relative measure of model certainty. A word with confidence 0.95 is considered significantly more certain by the model than a word with confidence 0.45.
Blind Spots and Limits
The critical limitation of confidence score: the model can be very confident and still be wrong. Typical cases are proper names, neologisms, and terms the model does not know from its training. Example: a model transcribes the surname "Kowalczyk" as "Kowalski" and assigns it confidence 0.92 — the word sounds similar, the model does not question it, yet the transcript is wrong.
The correlation between confidence and actual accuracy depends on the calibration of the specific model. Well-calibrated models have a strong correlation — high confidence genuinely predicts lower WER; less calibrated models do not have this property. When choosing a model, it is worth verifying this calibration on a sample.
How to Use Confidence in Practice
Automatically flag transcripts with low average confidence (typically below 0.75) for manual review. Visual highlighting of low-certainty words directly in the editor helps the reviewer quickly find problematic areas without reading the entire text. Czech Transcription System displays confidence per-word and per-segment in the interface — transcripts with problematic sections are immediately visible.
Approach 2: Inter-Model Agreement
The Principle
Send the same recording to multiple transcription models and compare their results. Where models produce the same word, the result is likely correct — if four of five models write "London," that is almost certainly right. Where models differ, this is a sign of uncertainty and a candidate for manual review.
The intuition: different models have different errors. If model A incorrectly recognizes a certain acoustic pattern and model B recognizes it correctly, their agreement on a result is a stronger signal of correctness than the result of a single model alone.
How to Measure Agreement
The technical implementation uses word-level output alignment (ROVER — Recognizer Output Voting Error Reduction). The result is the percentage of words where at least k of n models agree. A low agreement rate on a certain segment is a proxy for probable error.
Czech Transcription System can combine multiple transcription engines and measure agreement between their outputs automatically. The merge layer can then favor segments with higher consensus, making this approach implementable directly in the system architecture.
Advantages and Limitations
Main advantage: inter-model agreement catches systematic errors from one model, because other models are unlikely to repeat the same error. It works without a reference text and provides a signal across the full width of the recording.
Limitation: if all models share the same blind spot — for example due to similar training data or the same phonological pattern — consensus will still be wrong. And it requires transcription from multiple models, which increases processing costs for each recording.
Approach 3: Sampling Review (Human Spot-Checking)
Statistical Basis
You cannot review 100% of transcripts, but you can review a statistically representative sample. Basic statistics tells us that at 95% confidence and a ±5% confidence interval, a sample of approximately 1% of transcripts is sufficient to estimate overall accuracy. The result is a statement like: "We estimate the average WER this week at 8 ± 2%."
Stratified Sampling
Random sampling from all transcripts gives an average but hides variation across specific categories. A better approach is stratified sampling: take a random sample from each category separately — phone calls, lectures, research interviews, call center recordings. If one category starts showing worse accuracy, the system catches this sooner than an overall average would reveal.
Organizing the Review
A reviewer compares the machine transcript with playback of the recording and records errors or a subjective readability rating. Indicative timing: one minute of transcript requires two to four minutes of review. A regular weekly or monthly report tracks the accuracy trend over time and can detect model degradation or changes in the quality of incoming recordings.
Comparing the Approaches
| Approach | What it measures | Advantages | Limitations | When to use |
|---|---|---|---|---|
| --------- | --------- | -------- | --------- | ------------ |
| Confidence score | Per-word model certainty | Automatic, fast, no extra cost | Blind spots, depends on model calibration | Ongoing monitoring, visual review in editor |
| Inter-model agreement | Agreement of multiple models on result | Catches systematic errors, no reference needed | Shared blind spots, higher processing cost | Systems with multiple models, ensemble output validation |
| Human sampling | Actual accuracy on sample | Most accurate proxy for WER, calibrates other metrics | Time-consuming, does not scale to 100% | Regular audit, calibrating automatic metrics |
Combining into a Layered System
The most practical approach for production combines all three layers. Automatic first layer: confidence score and inter-model agreement flag problematic transcripts without human intervention. Second layer: regular human sampling verifies whether automatic metrics genuinely correspond to reality and calibrates thresholds. If average confidence drops and sampling shows the same trend, that is a strong signal, not a coincidence.
When Full WER Is Still Necessary
Selecting or comparing models cannot be done without reference WER on a shared test set. The same applies to evaluating fine-tuning or a regulatory audit that requires documented system accuracy. But these situations are planned and one-off — reference texts for them make sense.
Sources
- NIST STT evaluation methods — https://www.nist.gov/
- Guo, C. et al. (2017). On Calibration of Modern Neural Networks. ICML 2017.
- Fiscus, J.G. (1997). ROVER. IEEE ASRU 1997. [doi:10.1109/ASRU.1997.659110]