Transkripce

Chunking Long Recordings: How to Split an Hour of Audio Without Losing Sentence Meaning

Transcription APIs have limits. An hour of audio cannot pass through as a single unit — the model cannot process it all at once. The file must be split into parts. Where those parts are split determines whether the final transcript is smooth or full of artefacts. How chunking works, why boundaries matter, and how to tell when something has gone wrong.


Why Transcription Models Cannot Process an Entire Recording at Once

The limits are not on the user's side — they are structurally built into transcription APIs. The OpenAI Whisper API accepts a maximum of 25 MB per request. One hour of audio in MP3 at 128 kbps is approximately 58 MB — more than double this limit. Deepgram, AssemblyAI and other services work with time limits: typically 60 to 120 minutes of audio per call, depending on the plan.

The second factor is memory. Transformer models behind modern transcription load the entire input into memory at once to compute the attention mechanism. Too long an input overloads memory or dramatically slows processing. The limits set by model makers are not arbitrary — they reflect real computational constraints.

The practical consequence is direct: any recording longer than the limit must be split into smaller parts before being sent for processing. This process is called chunking.


Where to Find the Right Boundaries

Splitting the file into parts is easy. The problem is where exactly the split happens.

Splitting in the middle of a sentence causes the model of the second chunk to start transcribing without the context of the preceding words. If the first chunk ends at "... and the results show that —" and the second chunk begins "— the methodology was flawed from the start", the model of the second part does not know this is a continuation of the previous thought. It may transcribe phonetically correctly, but the sentence context is broken.

Silent Pauses as Natural Boundaries

The most reliable solution is to find boundaries where the speaker naturally interrupted their speech — in silent pauses. VAD (Voice Activity Detection) algorithms monitor the energy of the audio signal over time. Where energy drops below a threshold value for long enough, there is silence — a natural boundary in speech.

A silent pause corresponds to natural breaks: the end of a sentence, a topic transition, an in-breath. A transcript interrupted at a silent point is almost always smooth — no word is cut, context is preserved in the previous chunk.

Minimum silence duration for reliable splitting: typically 0.5 to 2 seconds. Shorter pauses are part of the sentence — longer pauses signal natural breaks.

Overlap as a Safety Net

Fast conversation or a lecture without sufficient pauses may have no suitable silent points for reliable splitting. In that case a fallback approach is used: a fixed chunk length — for example 10 minutes — with overlap.

Overlap means the last N seconds of the previous chunk are repeated at the start of the next. The model thus sees the transition context and can correctly transcribe words that lie on the boundary. A typical overlap length is 5 to 15 seconds. The overlapping portion is removed from the merged result — in the final transcript each word appears only once.


How Chunks Are Processed and Reassembled

Once the chunks are ready, they are not processed one by one — they are processed in parallel. Each chunk is a separate API request that can be sent simultaneously with the others. Transcribing an hour-long recording split into six ten-minute chunks takes roughly the same time as transcribing a single ten-minute chunk — not six times as long.

After processing, the chunk transcripts are joined in order. Each chunk has a timestamp offset — the time in the original recording at which it begins. Timestamps in the chunk transcript are converted to absolute times corresponding to the original recording by adding this offset. Clicking a timestamp in the transcript then jumps the player to exactly the right point in the recording.

At chunk boundaries, the merging algorithm checks the transition: whether the last words of the previous chunk do not correspond to the first words of the next — an artefact that can arise from overlap. The duplicate portion is removed.

A language model as a merging layer can further improve transitions. It sees context from both sides of the boundary and selects the naturally flowing variant at points where chunk transcripts slightly diverge.


What Can Go Wrong and How to Spot It

Ideal chunking is invisible — the resulting transcript looks as if it were processed all at once. Poor chunking leaves traces in the result.

Doubled Words at the Boundary

If the overlap was not properly removed and the merging algorithm did not delete the duplicate: "London London is the capital." Easily identifiable by visually checking the transcript at points where chunks meet.

Cut-Off Sentence

If the chunk split in the middle of a sentence without overlap: a sentence without an ending, or a fragment without a beginning. The model transcribes what it hears — but context for correct interpretation is missing. The result may be phonetically correct but semantically broken.

Incorrect Timestamps

A time jump at a chunk boundary: the timestamp jumps from 10:00 to 9:58 or to 10:15. This indicates an offset calculation error. It shows up when checking the transcript in the player — clicking on text does not lead to the correct point in the recording.

How to Verify Chunking Quality

Visual check: read the transitions in the transcript every 10 minutes of recording. Transitions should flow like the rest of the text without visible seams.

Timestamp check: click a marked time in the transcript and verify that the player jumps to the correct point. The timestamp at the start and end of the recording, and at least one in the middle — that is a sufficient spot-check.


Czech Transcription System handles chunking automatically. The user uploads a file of any length without manual splitting. Chunk boundaries are chosen based on silence detection; a sentence transcript is therefore not interrupted mid-thought. Timestamps in the resulting transcript correspond to the original recording — an hour of audio is as easy to navigate as a one-minute clip.


If you are interested in the technical side of files before transcription, read about audio input formats in A09. For processing very long recordings across multiple models simultaneously, the overview of result merging is relevant A13.


Sources: