Transkripce

Call Centre Transcription: Specifics, Regulation, and What to Expect from Automation

A call centre generates hundreds to thousands of calls daily. Transcribing these recordings is not an optional add-on — for many industries it is a legal obligation, and for others it is a source of data that would otherwise remain locked in audio recordings. Telephone audio has its own technical specifics that influence the choice of transcription model and the entire processing architecture. This article explains the technical conditions, legal requirements, and what to realistically expect from automation.


Telephone Audio and Why It Is Harder to Transcribe

Most of today's transcription models are trained on recordings with a sampling rate of 16 kHz or higher — typically podcasts, lectures, studio recordings. A telephone call over a classic landline (POTS — Plain Old Telephone System) transmits sound only in the 300-3,400 Hz band at a sampling rate of 8 kHz. The result is sound that is intelligible to the human ear but complicates the transcription model's work: it lacks the higher frequencies needed to distinguish certain phonemes.

The G.711 codec, the standard for classic telephony, adds its own compression artefacts. VoIP calls tend to be more favourable — wideband VoIP transmits the 50-7,000 Hz band, which significantly improves conditions for automatic transcription.

Practical impact: with standard telephone quality, transcription error rate (WER — Word Error Rate) is typically several percentage points higher than for full-quality recordings. Models specialized for telephone audio or approaches combining multiple models partially compensate for this disadvantage.

Stereo Recordings: Agent and Customer on Separate Channels

A major advantage of call centre recordings is their structure: standard practice is to save each speaker on a separate channel — agent on the left, customer on the right. The result is a stereo WAV file or two mono tracks.

For transcription, this means diarization (speaker identification) is trivial. Simply transcribe each channel separately and merge the results with timestamps — without the need for complex algorithmic separation of overlapping voices. If only a mono mix is available, the situation is more complex and diarization requires standard methods with higher error rates.

Volume as the Main Challenge

Large call centres process thousands of calls daily, with average call duration ranging from three to eight minutes. Manual transcription at this volume is economically unfeasible — automation here is not a luxury but a prerequisite for the entire process to function. The pipeline typically looks like this: the recording is completed, the recording platform sends a webhook, the transcription system processes the file, and the result is saved or forwarded to an analytics tool.


Legal Requirements: What Is Mandatory and Why

MiFID II for the Financial Sector

The MiFID II Directive (Markets in Financial Instruments Directive II) requires investment firms and banks to record and archive telephone calls related to investment advice and trading. Records must be retained for a minimum of five years, for certain regulated entities up to seven years. While transcription is not explicitly mandatory in every case, in practice it significantly facilitates auditability: searching text is orders of magnitude faster than listening to hundreds of hours of recordings.

Every transcript must be accompanied by metadata: date and time of the call, caller and callee numbers, advisor identification. Without this data, archiving for MiFID II purposes is incomplete.

GDPR and Call Recordings

A customer's voice recording constitutes personal data. Voice as a biometric identifier may under certain conditions fall into the special category of personal data under Article 9 of GDPR — this interpretation remains subject to expert debate and depends on the specific processing method.

The customer must be informed about the recording before the call begins. If the customer refuses recording but a legal recording obligation exists (for example, with financial services), sector regulation generally takes precedence. Recordings may be retained only for the period necessary for the given purpose — after that they must be deleted or anonymized.

PCI-DSS and Payment Data Masking

If a customer communicates a payment card number or PIN by telephone, PCI-DSS (Payment Card Industry Data Security Standard) requirements apply. In practice, this means the system must not record these numbers in readable form — the standard is DTMF masking, where digits entered on the keypad are not saved in the call recording.

In the transcript, automatic redaction is required: the system detects card number and identification number patterns and replaces them with placeholder symbols. Some transcription platforms offer this feature natively; an alternative is regex-based filters combined with named entity recognition.


What to Extract from Transcripts

Compliance is one reason for call transcription. Analytical value is the second, and for many call centres the more important one.

QA Audit Without Listening

Traditional quality control involves randomly selecting calls and having a supervisor listen to them. With transcripts, the entire process changes: instead of listening, text can be searched, filtered, and scored automatically.

Specific examples: automatic scoring of whether the agent used the required greeting; detection of calls where words like "complaint" or "lawyer" appeared — candidates for escalation; measuring script compliance in percentages. This analysis is possible for one hundred percent of calls, not just a random sample.

Sentiment and Frustration Detection

Every customer sentence can be automatically rated as positive, neutral, or negative. If a customer repeats the same thing three times in a row, or if their sentences contain expressions of dissatisfaction, the system can automatically flag the call. Comparing sentiment at the beginning and end of a call provides information about whether the agent resolved the situation.

For some languages, available sentiment models are less accurate than for English — this limitation is important to mention when setting expectations.

Call Summary and CRM Integration

After each call, the agent must write a note in the CRM system. This task takes on average two to five minutes. With automatic transcription and LLM post-processing, a call summary can be generated automatically: five sentences describing the call topic, the customer's problem, and the outcome — saved to the contact record without agent intervention.


Implementation: Model Selection and Architecture

Transcription Models for Telephony

Not every model performs equally well on telephone quality. Models trained on wideband audio have higher error rates on 8 kHz telephony. For call centres, suitable candidates include models specifically optimized for telephony audio and multi-model ensemble approaches.

The ensemble approach — combining outputs from multiple models with subsequent selection of the most reliable result — compensates for individual models' weaknesses, especially on difficult recordings. A multi-model transcription system can combine multiple engines and the merging layer selects the best match for each segment. For call centres with batch processing, webhook notifications enable automatic forwarding of results to downstream systems.

Real-Time vs. Post-Call Transcription

For real-time agent assistance (displaying the customer's text directly on the agent's screen, response suggestions), streaming transcription with sub-second latency is required. The architecture is more complex and accuracy requirements must be balanced with speed.

For QA audits, compliance archiving, and analytics, post-call transcription is sufficient: the recording is processed after the call ends, accuracy is higher, and the architecture is simpler. Most call centres start here.


What to Realistically Expect from Automation

Automated call centre transcription handles routine processing of call volumes that would be impossible manually. Accuracy on standard telephone quality ranges from 85 to 93% of words correctly recognized, depending on the model, recording quality, and presence of noise. Proper nouns, product codes, and company-specific technical terminology are areas where accuracy drops — and where investing in model customization with a company glossary makes sense.

Manual review will not disappear entirely: legal material, escalated calls, and transcripts with low confidence scores are worth human review. But well-configured automation significantly reduces this volume and focuses the remaining reviews where they truly matter.


Sources: