Transkripce

The Future of Speech Transcription: Where Today's Technology Hits Its Limits

Speech transcription has changed more in the last five years than in the previous thirty. Transformer architectures have replaced two-stage systems, real-time latency has dropped to hundreds of milliseconds, and foundation models can transcribe in hundreds of languages at once. Yet there remain areas where even the best current systems do not work reliably — and these gaps show most clearly where development is heading next.


From Two-Stage Systems to End-to-End Transcription

Until around 2020, virtually every commercial transcription system was built on a two-stage principle. The first stage — an acoustic model — processed the audio signal and converted it into phonemes or other acoustic units. The second stage — a language model — assembled these units into meaningful words and sentences.

The problem with this architecture is obvious: errors accumulate. If the acoustic model misidentifies a phoneme, the language model receives flawed input and must reconstruct a word that was never actually there. Moreover, each component required separate training and tuning — adapting the system to a new language or domain meant working on two separate parts.

End-to-end transformer models solved this problem elegantly. A single model maps directly from Mel spectrograms (a frequency-domain representation of audio) to text tokens — without intermediate steps. OpenAI's Whisper, described by Radford et al. in 2022, was trained on 680,000 hours of multilingual audio data from the internet. The result is a system significantly more robust to noise, different accents, and varying recording conditions than previous generations of tools.


Foundation Models — A New Class of Systems

Whisper opened a path that others followed. Google's Universal Speech Model (USM), described by Zhang's team in 2023, was trained on 12 million hours of audio data and covers over 300 languages. The key finding was that massive multilingual training leads to knowledge transfer between languages — outperforming specialized monolingual models even in languages where the model has fewer examples.

Meta went further still. Their Massively Multilingual Speech (MMS) project, published by Pratap's team in 2023, covers over 1,100 languages. It uses, among other sources, Bible translations as training data — one of the few texts available in so many languages simultaneously. For languages with very few digital resources, this is practically the only path to any automatic transcription at all.

What does this mean in practice? Languages with moderate representation in these models — neither as dominant as English nor as scarce as endangered languages — benefit from shared representations across related language families. A model that knows related languages well understands a given language better than if trained on that language alone. The estimated improvement in transcription accuracy between 2020 and 2023 amounts to double-digit percentage points of relative improvement — though specific numbers vary by test set and conditions.


Latency and Real-Time Transcription

The second major shift occurred not in accuracy but in speed. Streaming transcription — displaying text as speech is being delivered — was practically usable only for English just three years ago, and with non-negligible delay. Today, production systems like Deepgram Nova-2 or AssemblyAI achieve streaming latency of 100–200 milliseconds.

This number matters: human perception of time delay sits at approximately 200–300 ms. Below this threshold, the transcription feels synchronous with the speaker; above it, it starts to feel like a delayed response. Modern systems manage to cross this threshold — but at a cost. Streaming models typically work with a smaller context window than batch models, because they cannot wait for the end of a sentence. The result is slightly lower accuracy, especially for sentences with unexpected endings.

Parallel development is occurring in edge computing — transcription directly on the device without a cloud connection. Whisper tiny (39 million parameters) and base (74 million parameters) can run in real time on modern mobile processors. Apple's Neural Engine in current iPhones processes Whisper base models significantly faster than real time. The implications are threefold: data privacy, offline availability, and zero network latency.


Multimodality — Context Beyond Audio

Transcription systems have long worked exclusively with the audio signal. That is changing. Multimodal models are beginning to natively process audio as one of their inputs — without a separate ASR step. This opens possibilities that the two-stage approach could not offer: the model can take into account visual context (presentation slides in the background, the speaker's lip movements) or the textual context of a document being discussed.

In practice, the most widely used application today is personalization through voice enrollment. The system receives 30–60 seconds of a specific speaker's voice sample and adapts the acoustic model to their characteristics. The result is notably lower error rates for atypical accents or unusual voice qualities. This technique finds practical use particularly in medical dictation or environments where the same user transcribes regularly.


LLM Post-Processing — Benefits and Hidden Risks

Large language models have become a standard component of transcription pipelines. The raw ASR model output lacks punctuation, capitalization, and formatting — an LLM corrects this reliably and quickly. It can also remove filler words, detect numerical values, dates and abbreviations, and extract action items or named entities.

But the hidden risk is real: an LLM "correcting" a transcript can insert plausible but factually incorrect content. The model does not know the recording — it works with text and tries to make it coherent. A speaker's name, company name, or technical term that sounds similar to another expression may be different in the corrected version from what was actually said. For citation-sensitive content — court hearing transcripts, research interviews, meeting records — it is therefore safer to work with the raw ASR output without LLM post-processing and to make corrections manually.


What Remains Unsolved

Despite all progress, three areas exist where current systems still fail:

Strong accents and non-standard pronunciation. Non-native speakers — especially those with pronounced accents from distant language families — cause a notable increase in error rates. Annotated training data for these combinations is scarce, and the problem therefore improves more slowly than for accents well represented in the training data.

Domain-specific terminology. Medical abbreviations, legal terms, proprietary product names, and technical terminology remain challenging. Solutions exist today — custom vocabulary, dictionary hints, model fine-tuning — but they require additional work. A new term without model adaptation will be transcribed phonetically, creating nonsensical expressions in a specialized text.

Spontaneous speech. Incomplete sentences, false starts, speaker overlaps, laughter, coughing, and informal language constructions remain difficult for ASR systems. An important distinction applies here: formal transcription (clean, edited sentences) versus verbatim transcription (capturing everything that was said) require different approaches — and automatic transcription currently sits somewhere in between.


Outlook for the Next 5–10 Years

Where will speech transcription be in 2031? Personalized models available as a cloud service are a realistic goal within 2–3 years: a user uploads a voice sample and receives a model adapted to their accent and vocal characteristics. Fine-tuning on domain terminology (a doctor uploads 10 hours of their recordings) will likely be available as a standard offering.

The boundary between "transcription" and "speech understanding" will continue to blur. End-to-end models will extract structured information directly from audio — without the intermediate step of text transcription. Edge models will reach the quality of today's cloud solutions probably within 3–5 years; strongly accented speech and domain terminology will be largely solved. Truly robust transcription of spontaneous speech — with overlaps, false starts, and informal delivery — will likely take longer.

One thing is certain: the pace of change in this field is not slowing. For systems that work with transcription, this means one practical thing: architecture must anticipate that today's best model will not be the best model a year from now. A multi-model ensemble approach addresses this directly — a new model can be added to the set without rewriting the entire system, while existing models remain available for comparison.


Sources: