How Speech-to-Text Works: From Sound Waves to Written Words

March 24, 2026 · 6 min read ·

Speech-to-text is simultaneously a physical, mathematical, and linguistic process. A microphone captures air pressure waves, a converter turns them into numbers, a neural network searches for patterns, and a language model estimates the most probable sequence of words. Why this sometimes works flawlessly and other times produces surprising results is the subject of this article — no prior technical knowledge required.

What Happens When You Press the Record Button

Speaking is a physical phenomenon. Vocal cord vibrations cause pressure changes in the air that propagate through space as waves. A microphone captures these pressure changes and converts them into an electrical signal. An analog-to-digital converter then samples the signal — typically 16,000 or 44,100 times per second — recording each moment as a number.

The result is a numerical sequence: a timeline of pressure values. That is all the transcription algorithm works with. No words, no language, just numbers in time.

Why does the sampling rate matter? According to the Nyquist theorem, the sampling rate must be at least twice the highest frequency in the signal we want to capture. The human voice contains frequencies up to roughly 8 kHz — which is why 16 kHz sampling is sufficient for speech transcription. Higher rates (44.1 kHz for music) add no accuracy to the transcript, only file size.

The Spectrogram — How the Algorithm Sees Speech

A raw numerical sequence is not enough. Before a neural network can look for words, it must "translate" the audio into a form it can understand.

A Fourier transform breaks each short segment of audio (typically 25 milliseconds) into its frequency components. The result tells us: at this moment, this combination of frequencies is present at this intensity. The visual representation of this decomposition over time is called a spectrogram — and for the transcription algorithm, it is the image of speech.

Speech sounds have characteristic spectral fingerprints: "a" sounds different from "e," "s" different from "m." The model works with preprocessed features called MFCC (mel-frequency cepstral coefficients), which mimic how the human ear perceives frequencies — non-linearly, with greater sensitivity to lower frequencies (Gold, Morgan & Ellis, 2011).

An analogy: if audio is like text, a spectrogram is like breaking text down into individual letters with information about their size and weight. From letters you can build words — but only if you know how to read them.

From Sound Patterns to Words — Neural Networks in Action

The neural network receives a sequence of spectral features and searches for the word sequence that best matches. This process is not deterministic — it is a statistical estimate.

The CTC Principle — How the Model Finds Words Without Precise Alignment

The key algorithm that enabled modern transcription is called CTC (Connectionist Temporal Classification). It was proposed in 2006 and allows training a transcription model without explicit alignment between audio and text — the model derives the alignment itself from the data (Graves & Schmidhuber, 2005).

In practice, the model "walks through" the audio recording and assigns probabilities to all possible phonemes or tokens for each short segment. The sequence with the highest overall probability becomes the final transcript. Correct transcriptions are those the model "believes" to be most probable — not necessarily those that were actually spoken.

Training — Why the Data Behind the Model Matters

A model's capability depends on its training: the model learns from pairs of (audio, text) — millions of hours of recordings with corresponding transcripts. The more data and the more diverse it is, the better the model learns to generalize to new recordings.

Whisper, the transcription model released by OpenAI, was trained on 680,000 hours of multilingual audio from the web. The result is a model with significantly higher accuracy for English (where the most data was available) and acceptable accuracy for languages like Czech — where less data was available (Radford et al., 2022).

Non-English Languages as a Challenging Test Case

Transcription models were developed primarily in English-speaking environments and trained on English speech. Applying them to other languages is not a matter of translation — it is a transition to a different type of language.

Morphology — One Root, Dozens of Forms

English is an analytic language: grammatical relationships are expressed through word order and auxiliary words. Word forms barely change (dog / dogs — that is essentially all).

Many languages, such as Czech, are highly inflected: grammatical relationships are encoded in word endings. A single noun may appear in seven or more distinct forms depending on grammatical case. For the transcription algorithm, each form is a different sound pattern that must be recognized separately. The limited morphological variability of English means that models trained on it carry assumptions that do not transfer well to morphologically rich languages.

Data Volume as a Critical Factor

English speech is represented in training datasets by thousands of hours of validated recordings. Languages like Czech in the Mozilla Common Voice database — one of the main open sources for ASR research — have significantly smaller volumes. A smaller training set means a weaker foundation for generalizing to new speakers, dialects, and vocabulary.

Commercial providers have their own proprietary training sets — their volume for non-English languages is typically not publicly disclosed.

Why Different Systems Hear the Same Thing Differently

WER (Word Error Rate) is the fundamental accuracy metric: it indicates the percentage of incorrectly transcribed words. Formula: WER = (substitutions + deletions + insertions) / total words x 100.

Two different systems will transcribe the same sentence differently. Not because one is necessarily worse — but because each was trained on different data and in different ways. Their blind spots differ: model A struggles with specialized terminology, model B with spontaneous conversation.

This insight led to an approach that combines results from multiple models simultaneously: outputs are aligned, and where models disagree, a language layer decides based on context. This way, an error from one model may not affect the final transcript if another model succeeded at the same point. Model fusion is discussed in more detail in A13.

The transcription result also depends on post-processing: punctuation, diacritics, and capitalization are added after the initial transcription — and differences between systems emerge here as well. How to work with punctuation is covered in A08.

Conclusion

Speech-to-text is an elegant chain of transformations: pressure waves to numbers to spectral fingerprints to word probabilities to text. At every step an error can occur, and each step depends on the quality of the previous one.

Understanding this chain helps set realistic expectations: a good recording in a quiet environment will produce a better result than a poor recording processed by even the best model. And different systems hear the same thing differently — not because one is bad, but because each was trained differently.

For those who want to go deeper: how to improve your recording before transcription is covered in A12; speaker diarization — assigning utterances to specific voices — is explained in A04; and how to set up a custom terminology list for better accuracy is discussed in A06.

References

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint. [doi:10.48550/arXiv.2212.04356]
Graves, A. & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM networks. Proceedings of IJCNN 2005. [doi:10.1109/IJCNN.2005.1556215]
Gold, B., Morgan, N. & Ellis, D. (2011). Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Wiley. [ISBN 978-0-470-19536-9]
Mozilla Common Voice — Czech language dataset. https://commonvoice.mozilla.org/cs
NIST Speech Recognition benchmarks. https://nist.gov/speech-recognition