Multilingual Transcription: How to Handle Recordings Where Languages Alternate
An international meeting, a podcast with a guest from another country, or a bilingual workplace conversation — all these situations produce recordings where languages alternate. Modern ASR models handle multilingual content better than five years ago, but results depend heavily on how exactly the languages alternate in the recording. Sequential switching between blocks is handled well; code-switching within a single sentence is challenging for any model. This article explains when transcription works reliably, where the blind spots are, and how to approach mixed-language recordings practically.
Types of Multilingual Recordings — What Matters
Not all multilingual recordings present the same technical problem. The way languages alternate determines how reliable a transcript the user will get. Distinguishing three basic types helps set realistic expectations before transcription even begins.
Sequential Switching — Blocks in Different Languages
The simplest case: entire sections of a recording proceed in one language, then a clear transition to another occurs. A typical example is an international conference where each speaker uses their own language, or a call centre where an agent switches between languages between calls. The ASR model detects the language at the beginning of each segment and transcribes it consistently. Accuracy roughly matches monolingual recording accuracy for the given language — sequential switching therefore does not present a special technical problem.
Code-Switching — Alternating Within a Sentence
Bilingual or multilingual speakers naturally switch languages mid-utterance or mid-sentence, often without realizing it. In a bilingual workplace, this might sound like: "I was thinking today, dass es keinen Sinn macht, das so zu machen." Academic environments produce sentences mixing the primary language with English technical terms: "The results show significant improvement compared to the baseline."
This is technically the most difficult case for ASR models. The model must detect language at the sub-segment level — ideally at the individual word level — but most models work with segments lasting seconds or tens of seconds. In practice, the model "bets" on the dominant language of the segment and transcribes minority-language portions phonetically, translates them, or skips them. A 2023 study published at the ACL conference (Shi et al.) showed that Whisper's accuracy on code-switched data drops by 15 to 40 percent compared to monolingual benchmarks — the range depends on the specific language combination and degree of switching.
Closely Related Language Pairs
Closely related languages that are mutually intelligible — such as those within the same language family — present a specific challenge. Natural alternation between them in workplace teams or families is common. For transcription models, this creates a particular problem: the languages are similar enough that the model may confuse them.
Whisper distinguishes closely related languages, but for mixed segments it tends to transcribe words from one language phonetically as if they were from the other, or vice versa. Practical recommendation: for recordings with a significant component of a related language, explicitly set the model's language to that language for the relevant segments, rather than relying on automatic detection.
How Models Detect and Switch Languages
Whisper — A Multilingual Model Trained on 100+ Languages
OpenAI Whisper was trained on 680,000 hours of transcription in more than a hundred languages (Radford et al., 2022, arxiv.org/abs/2212.04356). Automatic language detection works by the model estimating the language at the beginning of each segment based on the first few seconds of audio and setting the appropriate decoding head. Detection occurs at the segment level — typically blocks around thirty seconds long — not at the individual word level.
This has implications for multilingual recordings. For sequential switching, detection works well: each block is long enough for the model to correctly identify the language. For code-switching within a segment, the model works with the dominant language and processes remaining portions with lower accuracy.
Setting the language=auto parameter triggers automatic detection. For recordings where the user knows in advance which languages are present, it is better to set the language explicitly — or separately for each segment, if the processing system allows it.
Deepgram, AssemblyAI, and Utterance-Level Detection
Deepgram Nova-2 and AssemblyAI offer language detection at the utterance level — a shorter speech unit bounded by pauses. This is a step beyond Whisper's segment-level detection, but still does not work reliably for switching mid-utterance. For production systems processing international recordings, these models are a good choice for sequential switching; code-switching within utterances remains a weakness.
Practical Approaches for Mixed-Language Recordings
Preparation and Model Configuration
The first step is to know the character of the recording before starting transcription. Is it sequential switching or code-switching? What languages does the recording contain? This basic information determines the strategy.
For sequential switching between two languages, running transcription with automatic language detection is sufficient — the result will be of good quality. For recordings with code-switching or mixed content with unclear distribution, explicit language setting or manual splitting of the recording into segments helps.
A multi-model transcription system can allow setting the language for the entire transcription or running multiple models in parallel — each with a different language setting. The results are then merged by a merging layer that selects the best transcript for each segment.
Parallel Transcription and Intelligent Merging
A more advanced approach involves running transcription two or more times in parallel: once set to one language, once to another, and potentially a third. Each model then transcribes the recording from the perspective of "its" language, and a merger model decides which transcript is more accurate for a given segment.
This approach improves accuracy particularly for technical vocabulary, where one model consistently makes errors. The cost is higher computational demand and longer processing time — a reasonable investment for recordings where accuracy matters.
Post-Processing and Manual Correction
Mixed-language recordings require more thorough manual review than monolingual ones. Priorities for review are transitions between languages, proper names transcribed phonetically, and numbers and abbreviations. For recordings that serve as the basis for legal or medical documents, manual verification of critical passages is a necessity regardless of the automatic accuracy achieved.
Technical Limits and Realistic Expectations
What No Model Handles Reliably
There are situations where current models fail regardless of how the transcription is configured. Language switching in the middle of a word (hybrid compounds like "IT-solutions" or anglicisms integrated into sentence structure) causes confusion in phonetic decoding. A speaker's strong accent in their second language can lead the model to assign a segment to the speaker's native language rather than the language they are actually speaking.
Languages with low representation in training data are another limitation. For combinations of major European languages, Whisper delivers relatively good results; for combinations involving languages with limited training data representation, results are significantly worse because these combinations had minimal representation in the training data.
When It Is Better to Split the Recording Before Transcription
For recordings where language blocks are clearly separated and their length exceeds two minutes, manual or automatic splitting and transcribing each segment separately with the correct language setting pays off. The result will be more accurate than transcribing the entire recording at once.
Tools like FFmpeg or Audacity allow quick segment trimming without quality loss. An alternative is speaker diarization — identifying speakers in the track — and transcribing each speaker separately with the correct language setting. This approach works well in situations where each speaker predominantly uses one language, even if they occasionally switch.
Conclusion
Multilingual transcription works reliably for sequential language switching and is acceptable for recordings with a moderate degree of code-switching. For intensive switching within sentences or for specific language combinations, the limits remain real. Knowing these boundaries in advance enables choosing the right strategy — and saving time on subsequent corrections.
Sources:
- Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arxiv.org/abs/2212.04356
- Shi, Z. et al. (2023). Findings of the NOTSOFAR-1 Challenge. ACL 2023, Association for Computational Linguistics.
- Deepgram. (2024). Language Detection Documentation. deepgram.com/docs/language-detection
- AssemblyAI. (2024). Automatic Language Detection. assemblyai.com/docs/speech-to-text/language-detection