Recording, File, Stream: How to Deliver Audio for Transcription and What Matters
Transcription does not begin with clicking a button. The recording format, sample rate, number of channels, and the way audio is delivered — each of these parameters affects how quickly and accurately processing occurs. This guide explains what drives format choices and what to recommend for the most common situations.
Three Ways to Deliver Audio — and What Each Means
Uploading a Finished File
The most common case: the user has a finished recording stored on their computer or in the cloud and wants to transcribe it. The advantages are significant: you can check audio quality before submitting, add custom terminology, and if needed repeat processing with different settings or a model.
Suitable for: interviews, lectures, meetings, podcasts — anything recorded in advance.
Recording Directly in the Tool
The user records audio directly through a web interface or mobile application and submits it for processing without an intermediate step. Advantage: no waiting for upload, a direct path. Risk: poor acoustics or a low-quality microphone combined with immediate submission without quality checking. The recording cannot be reviewed beforehand — an error in recording only appears in the transcription result.
Recommendation: before any important recording, always make a test recording and play it back.
Streaming — Live Transcription in Real Time
Audio arrives continuously (microphone, VoIP call, live broadcast) and the transcript appears progressively — with latency in the range of fractions of a second to a few seconds. Technically this is a different approach from batch processing: the model does not wait for a complete file; it works with short audio frames without knowledge of future words.
The consequence for accuracy: streaming transcription is typically less accurate than batch processing of a complete file, because the model lacks the context of future words for its current decisions. A comparison of real-time versus batch processing is covered in detail in A15.
Audio File Formats — What They Are and Why They Matter
Lossless Formats: WAV and FLAC
WAV is an uncompressed format. The file contains an exact numerical representation of the audio without any compression. For transcription this is the ideal foundation — no information was lost during saving.
FLAC is lossless compression. The file is smaller than WAV, but the audio is mathematically identical — decoding fully reconstructs the WAV. Ideal for archiving or transfer with limited bandwidth.
For transcription: both formats give the algorithm the maximum available information.
Lossy Formats: MP3, AAC, M4A, OGG
Lossy compression algorithms remove parts of the sound spectrum that the human ear perceives less acutely. The key parameter is bitrate:
- 320 kbps MP3: virtually indistinguishable from WAV on listening; fully adequate for transcription.
- 128 kbps MP3: standard quality; typically acceptable for transcription of spoken language.
- Below 128 kbps: noticeable degradation; sibilants and sounds with high frequencies are the first to be affected — and these phonemes are important for word discrimination.
M4A (Apple format, AAC codec) from a smartphone: typically good quality, 128–192 kbps, sample rate 44.1 kHz. Fully sufficient for spoken language transcription.
Video Files: MP4, MOV, MKV
Transcription services extract the audio track automatically. Audio from video is typically adequate as long as the video was recorded with a reasonable microphone. Video conference recordings (Zoom, Teams) depend on platform settings — a free-tier Zoom plan uses a lower bitrate than a Pro plan.
Telephone Recordings (8 kHz) — a Special Case
Telephony works at an 8 kHz sample rate — capturing only frequencies up to 4 kHz. A large portion of the speech spectrum (4–8 kHz) is physically absent: sibilants "s," "sh," "z," and most consonants are in this band. Models trained on 16 kHz receive an incomplete signal from a telephone recording, and transcription accuracy is significantly lower.
Recommendation for telephone recordings: where possible, record via an application with higher quality (for example, a dedicated recording app with quality settings). A standard GSM telephone call is a technical limit for transcription.
Sample Rate and Channel Count
Sample Rate — 16 kHz Is Sufficient
16 kHz is the standard for ASR (automatic speech recognition). Whisper, Google STT, Deepgram — all are primarily trained on this standard.
44.1 kHz or 48 kHz is the music standard — redundant for transcription. The model internally downsamples the data. The resulting accuracy is no higher; the file is unnecessarily larger.
8 kHz (telephone) is below standard — see above.
Practical recommendation: record at 44.1 kHz (the standard for smartphones and recording applications). The model will handle the rest.
Mono vs. Stereo — Situation Dependent
A stereo recording has two channels. These may be genuinely different sources on each channel — or simply a duplication of the same mono signal.
When stereo is an advantage: An interview between two people where each speaks into their own microphone connected to their own channel of the stereo recording. The diarization algorithm then receives each voice separately — without cross-talk from the other speaker. Diarization results are significantly more reliable A04.
When stereo does not help: One microphone recording to a stereo file. Both channels are identical — no advantage, double the file size.
Recommendation: Two speakers on two microphones → stereo (each on their own channel). One microphone → mono is sufficient.
Practical Recommendations for Common Situations
Smartphone: M4A (iOS) or MP3/AAC (Android), 44.1 kHz, 128–192 kbps — acceptable for spoken language transcription. Key parameter: distance from the mouth (optimally 15–25 cm) and quiet recording conditions.
Zoom or Teams: MP4 (video) or M4A (audio). Tip for Zoom Cloud Recording: the "Record each participant separately" feature saves each speaker to their own audio file — ideal for diarization. Zoom Cloud Recording settings.
Professional podcast: WAV 44.1 kHz, each speaker on their own channel or file. The transcription result will be the best of typical situations.
Archival or low-quality recording: First consider basic audio cleaning (Audacity, Adobe Podcast Enhance) before submitting for processing. Detailed procedure in A12.
Quick reference:
| Situation | Format | Sample rate | Channels |
|---|---|---|---|
| --- | --- | --- | --- |
| Smartphone | M4A / MP3 | 44.1 kHz | mono |
| Zoom / Teams | MP4 → M4A | 44.1 kHz | stereo (if each on own channel) |
| Podcast studio | WAV | 44.1 kHz | stereo / mono |
| Telephone call | MP3 | 8 kHz (limit) | mono |
| Archival recording | whatever is available | whatever is available | depends |
Conclusion
File format is not the primary question — but sample rate, bitrate, and channel count can noticeably affect transcription accuracy. WAV or M4A from a decent microphone will reliably give a good foundation. A telephone recording at 8 kHz is a physical limit that no model can overcome.
The most important steps before submitting for processing: verify the audio is not too quiet or distorted; prefer lossless or high-quality lossy format; for multi-speaker recordings consider stereo with each speaker on their own channel.
Sources
- Google Cloud Speech-to-Text — supported audio codecs and formats. https://cloud.google.com/speech-to-text/docs/encoding
- Deepgram — recommended formats and sample rates. https://developers.deepgram.com/docs/audio-formats
- OpenAI Whisper — supported formats. https://platform.openai.com/docs/guides/speech-to-text
- Zoom Cloud Recording — recording each participant separately. https://support.zoom.us/hc/en-us/articles/recording