Transkripce

Real-Time Transcription vs Batch Processing: When to Use Which

Real-time sounds appealing — the transcript appears almost as you speak. But you pay for that immediacy with lower accuracy, more complex architecture, and higher processing costs. Batch processing waits for a completed file, but returns a more accurate result. How these two approaches differ across four key dimensions and how to choose the right one.


Two Different Technical Architectures

Real-time and batch transcription are not just different speeds of the same process — they are distinct technical approaches with fundamentally different trade-offs.

Batch processing operates on the principle of complete input. You upload a finished file or provide a URL to an existing recording. The API processes the file as a whole — or in chunks that are all known before transcription begins. The model therefore "sees" the entire sentence both forward and backward when transcribing each word. This bidirectional context is the source of its accuracy: an ambiguous word in the middle of a sentence can be correctly transcribed based on what came after it.

Real-time transcription receives audio as a continuous data stream. The model receives small segments — typically 100 to 500 milliseconds — and must transcribe them immediately without knowing what will come next. It decides with the context of the past but without the future. The result is returned within 200 to 500 ms of the word being spoken — the user sees text almost in real time. But accuracy is systematically lower, especially at sentence ends, specialist terminology, and homophonic words where future context helps distinguish the correct variant.

Implementation complexity is the third difference. Batch processing is an HTTP POST request — file in, JSON response out. Implementation in hours. Real-time requires a WebSocket connection, audio buffer management, processing of incremental responses, reconnect logic on dropout, and state management between segments. Implementation in days to weeks.


Four Dimensions of Comparison

Accuracy

Batch processing systematically achieves lower word error rates. Bidirectional context helps resolve ambiguities that the real-time model transcribes differently.

Example: a doctor dictates "After administering furosemide the patient's condition stabilised." The real-time model at the word "furosemide" does not yet know what comes after — it hesitates and may select a phonetically similar drug. The batch model sees the whole sentence and the context "patient's condition stabilised" helps confirm the pharmacological term.

Relative WER difference: depending on conditions, 10 to 30% lower error rate with the batch approach. On a thousand words of recording this can mean dozens of different words.

Latency

Real-time: result within half a second of utterance. For live captions or voice as application input, this is a requirement, not a benefit.

Batch: wait time depends on recording length and processing system performance. Indicative value: 10 to 30% of recording length. A five-minute recording → result in 30 to 90 seconds. An hour-long recording → result in 6 to 18 minutes.

Cost

Real-time API maintains an open WebSocket connection for the entire duration of recording — including silent pauses and breaks. You pay for connection time, not just active speech. Result: real-time processing is typically 2 to 5× more expensive per processed minute of speech than batch.

With batch processing you pay for actual minutes of audio content. Silence, pauses and breaks are not counted in the price.

Technical Complexity

Batch processing: HTTP POST → wait → JSON. Any developer who has ever called a REST API can handle it.

Real-time: WebSocket is a different protocol from HTTP. Audio must arrive in the correct format and tempo. Partial results update progressively — the application must display incomplete text and replace it with final versions. On connection loss, reconnection is required without losing context. This is a technically demanding discipline.


Comparison Overview

Dimension Batch Processing Real-Time Transcription
----------- ----------------- ------------------------
Accuracy (WER) Lower error rate Higher error rate
Result available After processing (minutes) Within 500 ms of utterance
Cost per minute of speech Lower Higher (2–5×)
Implementation complexity Low High
Ensemble approach Easily achievable Difficult to achieve
Suitable for Recordings, documents Live events, interaction

When to Choose Which Approach

Real-time transcription makes sense when:

Batch processing makes sense when:

Hybrid Approach

Meeting tools like Otter.ai or Microsoft Teams use a combination of both: live captions are real-time for orientation during the meeting, the post-meeting transcript is batch for a more accurate archival record. Each approach serves a different purpose — and in combination they complement each other.


Czech Transcription System operates in batch mode. You upload a file and receive a completed transcript processed across multiple transcription engines in parallel, whose results are merged into one final version. This ensemble approach — which is one reason for its higher accuracy compared to individual models — is not practically achievable in real-time processing. The result is a more accurate transcript at the cost of waiting. For live captions a different, real-time solution would be needed.


The technical side of batch processing — how recordings are split into parts for the API — is described in the article on chunking A14. How the ensemble approach and merging of multiple transcript variants works is explained in A13.


Sources: