Machine Learning Behind Transcription: How Models Train on Speech and Why Data Matters
You upload an audio file and within seconds receive an accurate transcript. Behind this result stands a model that learned from hundreds of thousands of hours of human speech — and yet it makes errors on words any child would know. Why this happens, how sound travels from microphone to text, and why some languages present a greater challenge for transcription models than others — that is the subject of this article.
From Microphone to Text: How the Model Processes Sound
Automatic speech recognition (ASR) does not work by the model "listening" to sound the way we do. The entire process begins by converting sound into a mathematical form that a neural network can train on.
Mel Spectrograms — the Model Does Not Hear Sound, It Sees Its Image
A raw audio signal is just a sequence of numbers — amplitudes of pressure waves over time. For the model, this form is too "raw." So the sound is first converted into a mel spectrogram: a two-dimensional image where the horizontal axis represents time and the vertical axis represents frequency on a special mel scale.
The mel scale mimics human hearing perception: it distinguishes low frequencies more finely (which carry more speech information) and high frequencies more coarsely. The analogy with musical notation is apt — a spectrogram shows which "tones" (frequencies) sound at which moment. Whisper, for example, processes 30-second segments converted into an 80-band mel spectrogram [Radford et al. 2022].
CTC — Alignment Without Explicit Time Labels
Training a model the classical way would require precise time labels for every word: "word X was spoken from 1.23 s to 1.87 s." Manual creation of such annotations is extremely costly. CTC (Connectionist Temporal Classification) bypasses this problem.
A model with CTC predicts the probability of each character for each time window — including a special "blank" token. After training, sequences like "sss-ll-ooo-ww-ooo" collapse into "slovo" by removing duplicates and blank tokens. It is thus sufficient to pair the entire audio segment with the corresponding transcript without precise alignment.
The limitation of CTC is that each time window makes decisions relatively independently — the model handles long-range dependencies and context poorly.
Transformer Architecture — the Entire Sentence at Once
Modern transcription models use the transformer architecture, which handles context more elegantly. The encoder reads the entire mel spectrogram at once using the self-attention mechanism — each time step "knows about" all others. The decoder then generates text token by token, with access to the entire encoded audio at every step.
This allows the model to resolve initial ambiguity once it reads the rest of the sentence. The word "bank" in a financial context is different from "bank" in a river context — and a transformer can work with that. Research on wav2vec 2.0 [Baevski et al. 2020] also showed that pre-training on unlabelled data (without transcripts) significantly improves performance on small annotated datasets.
Training Data — the Foundation of Model Performance
A model is only as good as the data it trained on. This rule holds for machine learning in general, but doubly so for ASR.
How Much Data a Model Needs
For a basic model with reasonable accuracy, tens of thousands of hours are needed. Mozilla Common Voice has over 20,000 hours of transcripts for English. For a robust model that handles diverse environments, accents, and topics, the figure is 100,000 hours and more.
Whisper was trained on 680,000 hours of multilingual speech downloaded from the internet [Radford et al. 2022]. A model with less data works well on clean studio audio but fails in noisy environments, on phone calls, or with unusual accents. Data diversity is as important as volume: the model needs to see different age groups, dialects, topics, and environments.
Data Quality Is as Important as Quantity
Poorly transcribed training data is more dangerous than the absence of data — the model learns errors and reproduces them. Whisper was trained on automatically filtered web subtitles of varying quality. The result is occasional word substitutions that sound similar or domain-specific errors.
"Gold standard" data — manually transcribed and verified — is expensive but highly valuable from a training perspective. Automatically generated transcripts are cheaper but propagate errors from the parent model into subsequent generations.
Under-Resourced Languages
Speakers of widely spoken languages far outnumber speakers of smaller languages, and online content in dominant languages is even more abundant. For large multilingual models, less training data tends to be available for under-resourced languages. Practical impact: for a comparably recorded audio file, under-resourced languages typically perform worse and require more manual review.
Language-specific properties compound the challenge: rich morphology, extensive inflection, and complex word formation mean that tokenization (splitting text into units the model works with) is more complex. Words are on average longer, and colloquial forms common in everyday speech tend to be underrepresented in training data.
Transfer Learning — Standing on the Shoulders of Giants
Training a model from scratch is extremely expensive — we are talking months of compute time and millions of dollars in electricity. Transfer learning offers a way to radically reduce these costs.
A model pre-trained on a large general dataset (Whisper on 680,000 hours) captures general properties of speech — phonemes, prosody, transitions between sounds. Fine-tuning then teaches this model on a small, specialized dataset: a thousand hours of medical transcripts, a few hundred hours of court proceedings, or recordings from a manufacturing floor.
The result is a model that knows general speech and additionally specialized terminology — without needing to train from scratch. Costs drop from millions to thousands of dollars.
The risk of fine-tuning is overfitting to a narrow domain: the model excels at medical vocabulary but deteriorates on everyday conversation. That is why a gentler approach is sometimes chosen — custom terminology glossaries that guide the model to prefer specific expressions without changing its weights.
Why Combining Models Outperforms Individuals
Even the best individual model has blind spots. A model trained on phone calls makes errors on fan noise; a model trained on podcasts fails on compressed phone audio. Their errors are not correlated — when model A makes a mistake, model B usually does not.
The ensemble approach exploits this: if most models agree on a transcription, that transcription is probably correct. The classic ensemble algorithm is ROVER (Recognizer Output Voting Error Reduction), where models "vote" word by word. A more modern approach uses a large language model as an arbiter — selecting or synthesizing the most meaningful variant with regard to the context of the entire sentence.
This is precisely how a multi-model transcription system can work: combining outputs from multiple transcription engines. The merging layer then selects the most likely correct variant for each segment from all available transcripts. The resulting accuracy can be higher than any individual model — especially on difficult material such as noisy recordings, crosstalk, or specialized terminology.
Conclusion
Automatic transcription is not magic — it is the result of hundreds of thousands of hours of training data, decades of research, and engineering effort. Some languages remain where there is still room for improvement: morphological complexity and smaller training data volumes lead to higher error rates compared to well-resourced languages. Multilingual models and ensemble approaches significantly improve the situation — but even for them, recording quality and domain relevance of training data remain decisive. How to compare services in practice is described in A31; what specifically influences error rate is analysed in A07; and working with custom terminology is covered in A06.
Sources:
- Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv. https://doi.org/10.48550/arXiv.2212.04356
- Baevski, A. et al. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv. https://doi.org/10.48550/arXiv.2006.11477