Combining Multiple Transcription Models: Why One Transcript May Not Be Enough
Every transcription model has blind spots — places where it consistently makes errors. A different model will fail somewhere else. Combining their results instead of relying on a single source is an approach from machine learning that is gradually making its way into speech transcription. How it works, what determines the final result, and where the limits are.
Why One Model Is Not Enough
No transcription model is perfect. Each fails differently — and that is the opportunity.
A model trained on lectures handles prepared speech well but transcribes spontaneous conversation less accurately. A model optimized for low latency and streaming may have lower accuracy on specialized terminology. A model with strong language support for one language may fail on regional accents underrepresented in its training data.
This variety of blind spots has a consequence: the error of model A and the error of model B only partially overlap. Where model A fails, model B may succeed — and vice versa.
Confidence Score as a Guide
Transcription models assign each transcribed word a confidence score — a number from 0 to 1 expressing how certain the model is. A high confidence score does not necessarily mean a correct transcript. The model may be confident in a (wrong) result because it matches its statistical expectations — but the correct word is outside its vocabulary.
Example: a specialized medical term — the model transcribes it as a phonetically similar common word with a confidence score of 0.85 (the model is confident), while the correct word would have a score of 0.12 (the model barely "sees" it). Combining results from multiple models addresses this problem: a different model may have learned the term better from its training data.
The Ensemble Approach — Combining Models
Where the Approach Comes From
Ensemble methods are an established approach in machine learning: combining multiple models produces a better result than the best individual model. Bagging (random forest), boosting (XGBoost), and stacking are examples of methods where different models combine their predictions. The core principle: different models have different errors; aggregating their predictions averages out the errors and increases overall accuracy (Dietterich, 2000).
Applying this to transcription is a natural analogy: each transcription model is a different classifier over the audio signal.
Historical Basis: the ROVER Algorithm
ROVER (Recognizer Output Voting Error Reduction) was designed by Fiscus (1997) specifically for combining ASR system outputs. The principle: align the outputs of multiple transcription systems at the word level, then vote on the most probable word for each position. Systems with historically higher accuracy receive greater weight in the vote.
ROVER was the first standardized method for combining transcripts — and still serves as a reference point for more modern approaches.
Three Ways to Combine Results
Majority voting: The word on which the majority of models agree wins. Simple, but does not account for different model quality. If three of four models transcribe incorrectly, the result is wrong — even if the fourth model had it right.
Weighted voting: Models are assigned weights based on their historical accuracy for the given recording type. A model with higher accuracy gets more influence on the result. This approach is more sophisticated and more robust against the situation where weaker models "outvote" stronger ones.
Language model as arbiter: A large language model (LLM) receives the results of all models as input and selects the most probable variant considering linguistic context. It can distinguish "scientific research" from "scientific researcher" by understanding which variant makes more sense in the context of surrounding sentences. This is the most sophisticated approach.
How Combining Works in Practice
Aligning Results
Different models return differently-length transcripts with shared but not identical timestamps. "London" may be in model A between 00:01:12.0 and 00:01:12.8, in model B between 00:01:11.9 and 00:01:12.9. Before voting, results must be synchronized at the word level — edit distance algorithms (Levenshtein) or DTW (Dynamic Time Warping) perform this alignment automatically.
Language Model as the Merge Layer
A large language model (LLM) can function as an "arbiter" between multiple transcript variants. It receives several versions of the same passage as input and produces final text. It does not just select which word to accept — it works with the context of the entire sentence and entire paragraph.
In Czech Transcription System, the merge layer takes multiple transcript variants and produces one final text. It considers the linguistic context, terminology supplied by the user, and grammatical rules — and selects the most probable wording for each passage of the recording.
Output of Combining
One transcript as the final result with the added value of combination: the strengths of each model appear in the passages it handles best. Optionally, a confidence score is available as a summary of model agreement — words where models disagreed have lower scores and are candidates for verification.
Advantages and Limits — a Realistic Perspective
Demonstrated Advantages
Research consistently shows lower WER for the combined system compared to the best individual model under comparable conditions. Fiscus (1997) documented a 5–15% reduction in WER compared to the best individual. Robustness: failure or significantly worse performance of one model does not destroy the result — the others compensate. For specialized terminology: a model that does not know a specific term can be "outvoted" by a model that has it in its training data.
Clear Limits
Shared errors: If all models share the same blind spot — for example all are weak on a specific regional accent — combining will not help. The result will be wrong across all models.
Cost: The more models you involve, the higher the computational costs and API call costs. This is a real constraint for routine or high-volume processing — and the reason why the ensemble approach is typically used selectively (only where accuracy really matters).
Processing time: Even with parallel processing, coordination, collecting results, and combining add latency. For recordings where response speed matters, the ensemble approach is less suitable.
Deployment complexity: Technically far more demanding than a single API endpoint.
When Combining Is Worth It
Worth it: recordings where an error in transcription is costly (legal, medical, archival records); specialized terminology where no single model is reliable; low tolerance for errors — organizations requiring maximum accuracy.
Not worth it: simple monologues in clean environments with standard vocabulary, where one good model achieves acceptable accuracy; situations with a requirement for minimum latency.
Conclusion
Combining models is the logical response to the fact that no model is perfect. The ensemble approach, refined in machine learning, finds its natural application in speech transcription — and results confirm it. Combining multiple transcript variants through a merge layer can achieve accuracy in difficult passages that no individual model would reach.
The cost of higher accuracy is real: higher computational costs, longer processing time, greater complexity. This is a tradeoff that each user must calculate based on their own priorities and accuracy requirements.
If you need to improve accuracy on proper names and specialized terms, see A06. For a critical look at the accuracy metrics that combining improves, see A07. And if you want to compare services in practice, see A31.
Sources
- Fiscus, J.G. (1997). A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). IEEE ASRU 1997. [doi:10.1109/ASRU.1997.659110]
- Dietterich, T.G. (2000). Ensemble methods in machine learning. LNCS 1857. [doi:10.1007/3-540-45014-9_1]
- Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv. [doi:10.48550/arXiv.2212.04356]