Audio Quality and Transcription: What Noise, Echo, and Compression Do to Your Results
Automatic transcription models are trained primarily on clean speech, but real-world recordings are not clean. HVAC noise, conference room echo, or compression artefacts from a video call increase error counts in ways that even the best available models cannot fully compensate. Transcription models improve year by year, but the physics of acoustics does not change — and forty dB of noise above the speech signal cannot be removed by any neural network. This article explains what exactly happens on the path from microphone to text, which numbers can measure the degradation, and where in the entire chain it makes sense to intervene.
SNR — the Foundation of Everything
What Signal-to-Noise Ratio Is and How It Is Measured
Signal-to-Noise Ratio (SNR) expresses the ratio of useful signal power — speech — to the power of all ambient noise. It is measured in decibels using the formula SNR = 10 * log10(P_signal / P_noise). In practice: SNR 20 dB means speech power is a hundred times greater than noise power; SNR 0 dB means they are equal; negative SNR values indicate noise is stronger than speech.
Typical values in practice: a professional recording studio achieves SNR above 40 dB. A quiet office without ventilation runs around 25-30 dB. A conference room with air conditioning running drops to 10-20 dB. An open-plan office with background activity or a recording from a cafe environment falls below 10 dB.
The relationship between SNR and Word Error Rate (WER) is not linear — it is exponential degradation. Research on the CHiME-4 benchmark (Kinoshita et al., 2016), which tests transcription in real environments, shows that at SNR above 20 dB, WER stays in the single-digit percentage range for modern models. When SNR drops to 10-15 dB, WER rises to 10 to 25 percent. Below the 5 dB SNR threshold, WER exceeds 40% even for models trained on noisy data. For reference: 40% WER means every second or third word is wrong — the resulting text is barely usable.
Types of Noise and Their Impact on Transcription
Not all noise is equally harmful. Stationary noise — air conditioning, CPU fan, electrical hum at 50 Hz — has a stable and predictable frequency profile. Spectral subtraction algorithms handle it relatively reliably: they measure the noise profile during a silent passage and then subtract it from the entire recording. Background air conditioning therefore troubles the transcription model less than an uninformed observer might expect.
Non-stationary noise is a different category. Keyboards, traffic, voices of other people in the room, or restaurant chatter change over time and have no stable frequency character. Filters designed for stationary noise fail on them. Transcription models struggling with this type of noise make more errors especially on short monosyllabic words and at the beginnings and ends of words. Impulse noise — a tap on the table, a cable pop, a click — is typically brief but causes a local gap in the transcription and can confuse diarization (speaker identification).
Echo and Reverb — Room Acoustics
The Physical Principle and Why Echo Hurts Transcription
Echo occurs when sound reaches the microphone via multiple paths — directly from the mouth and indirectly after reflecting off walls, ceiling, and floor. The microphone then hears each word multiple times with varying delays, where each copy has a different frequency profile altered by surface absorption.
The standard measure of room reverberation is RT60 — the time it takes for sound level to drop by 60 dB after the source stops producing sound. The ANSI S12.60-2002 standard for educational spaces recommends RT60 below 0.6 seconds for an empty room; for optimal speech intelligibility, the ideal is below 0.4 seconds. Typical values: a carpeted office with furniture achieves 0.3-0.5 s; a bare concrete conference room 1-1.5 s; an empty lecture hall 1.5-3 s; an empty warehouse or church over 3 seconds.
Transcription models struggle with echo for two reasons. First, the model must distinguish word boundaries — but at RT60 1.5 s, reverberation extends across two or three subsequent syllables, so boundaries are not acoustically clean. Second, similar-sounding phonemes (b/p, d/t, g/k) differ in the duration of their burst phase — and this short acoustic event is the first thing blurred by echo. The result is confusion of voiced and unvoiced consonants, which significantly increases WER.
Practical Measures Without Structural Modifications
The most effective measure against echo costs the least: shorten the distance between microphone and mouth. The inverse square law applies — doubling the distance reduces direct sound intensity fourfold, while reflected sound remains approximately the same. A microphone 20 cm from the mouth captures a significantly stronger direct signal than a microphone on a table a metre from the speaker, even in the same room.
Soft surfaces in the room absorb sound and shorten RT60. Carpet, curtains, upholstered chairs, and fully stocked bookshelves function as passive acoustic dampeners — usually available without any room modifications. A practical tip for improvised recording: record in a room with a wardrobe full of clothes, which absorbs reflected sound better than almost any other available surface.
Cardioid (directional) microphones have a heart-shaped pickup pattern — they capture sound primarily from the front and suppress input from the sides and rear. In practice, this means reflected sounds arriving from walls are captured significantly weaker than the speaker's direct voice. Omnidirectional microphones are suitable for group recording around a table but in reverberant environments inevitably capture more echo.
Compression Artefacts and File Formats
How Lossy Compression Damages Speech
MP3 and AAC use psychoacoustic masking — they trim frequency components that the human ear theoretically cannot perceive in a given context because they are masked by stronger components in adjacent frequency bands. The algorithm assumes that what a human cannot hear does not need to be stored. But a transcription model does not work like human hearing — its internal representation of sound differs from the human auditory system and may be sensitive to frequency components that the psychoacoustic model declares expendable.
The most vulnerable sounds are sibilants and affricates — sounds like s, sh, z, ch. Their energy lies in the 4-8 kHz band, which lossy compression at low bitrates trims first. These sounds are important for distinguishing word forms in many languages. At 32 kbps MP3 mono, losses are noticeable even to human listeners; 64 kbps is borderline; 128 kbps and above cause minimal degradation for transcription.
Repeated compression is a separate problem. A recording saved as MP3, transferred via WhatsApp or Messenger (which performs its own compression), and then saved again as MP3 has gone through compression two or three times. Each pass adds new artefacts because the decompressed file never exactly equals the original and the compressor again seeks frequency components to trim. The resulting recording sounds seemingly acceptable to human listeners, but the transcription model achieves significantly worse accuracy on it.
Recommended Formats and Settings for Transcription
WAV with uncompressed PCM (16-bit, 16 kHz, mono) is the ideal format for transcription. It is lossless, compatible with every transcription interface, and 16 kHz matches the speech intelligibility band — higher sampling rates add unnecessary data for transcription. The downside is large files: approximately 1.9 MB per minute.
FLAC offers lossless compression to roughly 50-60% of WAV size without any information loss. It is a good compromise for archiving and batch processing. MP3 128 kbps is acceptable for one-time transcription, unsuitable for archiving intended for repeated processing. Opus, the codec used by Zoom, Teams, and Discord, achieves quality comparable to MP3 128 kbps at 64 kbps and is acceptable when the video conferencing platform is configured appropriately. At settings of 32 kbps or less — which happens with weak connections — losses are noticeable even for the transcription model.
Audio Pre-Processing Before Transcription
Noise Reduction
Audacity Noise Reduction is a free tool with a low barrier to entry. It works in two steps: first you select a silent passage in the recording and the system saves the noise profile; then the algorithm subtracts this profile from the entire recording. It works reliably on stationary noise — air conditioning, transformer hum, fan rumble. On non-stationary noise it fails, and with aggressive settings it produces so-called musical noise — metallic, artefact-laden sound that can confuse transcription models more than the original noise.
DeepFilterNet (Schroter et al., 2022) is an open-source neural network for noise reduction. Unlike traditional spectral algorithms, DeepFilterNet learned to distinguish speech from noise from a vast number of examples and handles non-stationary noise as well. Results achieve quality close to studio-grade reduction with minimal artefacts. The tool is freely available on GitHub and runs locally without sending data to external servers. RNNoise is a lighter alternative for real-time applications with low computational requirements but lower effectiveness on complex noise.
Normalization and Basic Signal Processing
Volume normalization brings the recording to a standard level before submission for transcription. The transcription API then receives neither an overly quiet signal (where noise dominates) nor an overdriven recording (where clipping occurs). A common target is -16 LUFS (standard for voice content) or -20 dBFS as peak normalization.
A high-pass filter set to 80-100 Hz removes low-frequency content that speech does not carry — distant traffic, building vibrations, electrical mains hum. A limiter protects against impulse peaks that could distort normalization or cause clipping during conversion. It is important to follow the correct order: noise reduction first, then normalization — if you reverse the order, normalization amplifies noise along with the signal.
How the Ensemble Approach Helps with Degraded Audio
Different transcription models have different architectures and training datasets. This means each reacts to a specific type of degradation differently. One model may be more robust to echo due to data augmentation during training; another may handle compression artefacts better due to a different pre-processing layer.
A multi-model transcription system can combine results from multiple models, and the merging layer then selects the most likely variant for each segment. With degraded audio, this means in practice that one model's error need not be every model's error — the merged result is then more accurate than any single model's output. This effect is not unlimited, however: at very low SNR, error rates of all models rise in parallel and merging results cannot compensate for the physical loss of information in the recording. The ensemble approach increases resilience to degradation but does not replace good recording conditions.
Practical Recording Recommendations
Choosing a Microphone for the Situation
USB condenser microphones (Blue Yeti, HyperX QuadCast, Rode NT-USB Mini) are suitable for home offices and podcasts. Cardioid pickup patterns effectively suppress ambient noise. Lavalier (lapel) microphones bring the pickup element close to the speaker's mouth without a stand — ideal for lectures, interviews, or field recording. Dynamic microphones (Shure SM7B, Electro-Voice RE20) are less sensitive than condensers, so they capture less ambient noise — they are the standard for broadcasting in noisier environments.
A headset with an integrated microphone is an acceptable compromise for video conferences because the microphone stays close to the mouth. Recording quality typically lags behind dedicated microphones — condenser elements in headsets are smaller and frequency range is limited.
Environment and Microphone Placement
The optimal microphone-to-mouth distance is 15 to 30 cm for condenser microphones and 5 to 15 cm for dynamic ones. At shorter distances, the proximity effect manifests — bass enhancement caused by near-field physics that changes voice timbre with condensers. At greater distances, the proportion of reflected sound increases.
The worst environments for recording are large empty rooms, corridors, and spaces with hard walls and floors — bare concrete, glass, or ceramic tiles absorb minimal sound and RT60 easily exceeds one second. The best improvised acoustic treatment available almost anywhere: record in a room with a wardrobe full of clothes. Fabric absorbs reflections more effectively than many commercial acoustic panels.
Conclusion
Automatic transcription accuracy depends on three main physical factors: signal-to-noise ratio (SNR), room acoustics (RT60), and the recording's compression method. Each of these factors can be influenced — and the resulting impact on WER is measurable and predictable.
Priority order for practice: first, it pays to focus on room acoustics and microphone distance — these are changes with the greatest impact and lowest cost. Only then does it make sense to address format and bitrate choices, or pre-processing the recording before transcription. Transcription models are constantly improving and the ensemble approach combining multiple models increases resilience to imperfect recording quality. But the physical loss of information in the recording is irreversible — audio data that the microphone did not capture cannot be filled in by any model.
Sources:
- Kinoshita, K. et al. (2016). The REVERB challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11), 1849-1860.
- Schroter, H. et al. (2022). DeepFilterNet. ICASSP 2022. https://arxiv.org/abs/2110.05588
- ANSI S12.60-2002. Acoustical Performance Criteria for Schools.
- Barker, J. et al. (2015). The third CHiME speech separation and recognition challenge.
- Audacity Team (2024). Noise Reduction effect documentation.
- RNNoise: Jean-Marc Valin. https://jmvalin.ca/demo/rnnoise/