Transcription Confidence Score: What That Number Actually Tells You

March 24, 2026 · 5 min read ·

A transcription model assigns each word a number from zero to one. The higher the number, the more certain the model is. But the model can be confident about a mistake. And it can hesitate about the right answer. How to read a confidence score with appropriate skepticism — and what it is actually useful for.

What Confidence Score Mathematically Expresses

A transcription model works with probabilities. For each time window in the recording, it calculates a probability distribution over the entire vocabulary in its dictionary. A softmax function normalizes these values to the range 0 to 1, where the sum across all words is exactly 1.

The confidence score is the probability of the word the model selected as the most probable.

Practical example: for an audio segment corresponding to the word "London," the model might assign: London 0.94, Lunden 0.03, Londn 0.02, other 0.01. Confidence score = 0.94. This does not mean the model is right 94% of the time — it means that of all words in its vocabulary, London is currently the most probable variant with this score.

Per-word and per-utterance confidence

Per-word confidence assigns a score to each word individually. It is the most detailed and most useful for identifying problematic areas. Available in JSON exports from transcription services, where each word entry carries a confidence field.

Per-utterance (segment) confidence is a single score for an entire utterance or segment. Typically the average or minimum of per-word values in the segment.

An overall transcript score — the average of per-word confidence — is a misleading metric. A transcript with an average of 0.85 can hide ten words with confidence 0.3, where errors are concentrated. The average is an indicative measure; the per-word distribution is more informative.

Why High Confidence Does Not Mean a Correct Transcript

This point is critical and is often misunderstood. The model can be systematically confident about a wrong answer.

Model Calibration

A perfectly calibrated model would transcribe correctly in exactly 90% of cases for words with confidence 0.9. In practice, transcription models are calibrated to varying degrees — some systematically overestimate their certainty (are overconfident), others underestimate it.

Without knowledge of the calibration curve of a specific model, the number 0.9 is not absolutely interpretable. It is a signal of relative certainty — higher is better — but not a guaranteed probability of correctness.

The Blind Spot: Out-of-Vocabulary Words

The model does not know what it does not know. A specialized term or proper name that did not appear in training data is mapped by the model to the phonetically closest word in its vocabulary. And it does this with high confidence — because from its perspective, it selected the most probable option.

Example: an uncommon surname "Przybyszewski" → the model transcribes it as something phonetically similar with confidence 0.89. The error is factual, but the model does not know it. The transcript is the optimal answer within its framework to the audio input.

This blind spot is the most insidious cause of deceptively high confidence on incorrect transcripts.

Low Confidence on a Correct Transcript

An unusual name or term the model encounters for the first time receives low confidence — the model hesitates. But even while hesitating, the model may select the correct variant because it is phonetically closest. Low confidence ≠ incorrect transcript. It is a signal of model uncertainty, not a guarantee of error.

A strong accent or noise in the recording increases model uncertainty without direct correlation to error rate — the model may correctly transcribe even at low confidence if the correct word has no close phonetic competitor.

How to Use Confidence Score Practically

Prioritizing Edits

The most practical use: set a threshold (for example confidence < 0.65 or < 0.70) and flag words below this threshold as candidates for manual verification. The editor focuses on these locations instead of reading through the entire transcript.

Result: editing an hour-long recording is reduced from sequential reading of the full text to checking flagged locations — dozens or hundreds of words instead of thousands. Time saved depends on the confidence distribution in the given recording.

Note: the threshold must be set empirically for the specific recording type and specific model. A threshold of 0.65 on a recording from a professional studio will flag a different volume of words than the same threshold on a telephone call recording.

Input for Model Combining

In the ensemble approach — where multiple models process the same recording — the confidence score serves as input for the decision algorithm. Weighted voting: the model with a systematically higher average confidence gets greater weight in voting. Per-word selection: for each word, the variant with the highest confidence across all models is selected.

Limitation: incomparability of confidence between models. Different engines scale values differently. Comparison requires normalization or calibration to a common scale — otherwise a model with more aggressive scaling will systematically dominate voting regardless of actual accuracy. See A13.

Rough Quality Metric for the Recording

The average confidence score of the entire transcript serves as a rough proxy for recording quality: a low average suggests a recording with problems — noise, overlaps, accents, or unknown terminology. This information helps decide whether to pre-process the recording (denoise, normalize levels) or accept a higher volume of editing as a given.

What Confidence Score Does Not Tell You

Factual correctness of a statement: "Sea levels have fallen by 300 meters over the last hundred years." The model transcribes this with high confidence even though it is factually wrong. The model transcribes audio; it does not evaluate content.

Grammatical correctness in context: per-word confidence evaluates individual words, not their syntactic relationships. A sentence may be phonetically transcribed correctly yet grammatically nonsensical in a specific context.

Semantic meaning: correctly transcribed words in the wrong order or combination may be nonsensical. Confidence does not catch this.

Incomparability between models is practically important: confidence 0.9 from one engine does not equal confidence 0.9 from another. Each model has its own scaling and calibration. Comparing models by raw confidence values is methodologically flawed.

Sources

Guo, C. et al. (2017). On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning (ICML 2017).
Deepgram documentation — confidence score. https://developers.deepgram.com/docs/pre-recorded-audio
AssemblyAI documentation — confidence score calculation. https://www.assemblyai.com/docs/faq/how-are-word-transcript-level-confidence-scores-calculated

How confidence score feeds into combining results from multiple models is described in the ensemble approach overview A13. Overall accuracy metrics — WER and CER — complement confidence score as a system-level view A07.