Automatic Transcription in Customer Research: Analysing Interviews at Scale
Customer research is built on interviews — but fifty interviews of forty-five minutes each represent over thirty-seven hours of recordings that no one can manually transcribe and systematically analyse within a realistic timeframe. This arithmetic forces research teams to make compromises: analyse only a subset of data, rely on notes instead of transcripts, or postpone the project indefinitely. Automatic transcription changes this equation. This article describes how transcription fits into the research process from recording to synthesis of findings, what to realistically expect from it, and how to integrate it with tools for coding and qualitative data analysis.
Why Data Volume Is a Problem in Customer Research
Customer and UX research produces audio and video recordings in volumes that grow with each new product iteration or research project. In-depth interviews typically last forty-five to ninety minutes and cover one respondent; usability tests add screen recordings; focus groups involve four to eight speakers at once for an hour or longer. Diary studies generate shorter but more numerous recordings. Customer service conversations arrive in volumes that render any manual approach unrealistic.
An experienced transcriber processes one hour of recording in four to six hours of work. Using conservative estimates: fifty interviews of forty-five minutes yield thirty-seven and a half hours of material. Manual transcription would take one hundred fifty to two hundred twenty-five hours — four to six weeks of full-time work devoted exclusively to transcribing, not analysing. And that assumes the transcriber works efficiently, without interruption, and with sufficient knowledge of the subject context.
The consequences are visible in practice: research teams analyse only a sample of data, hoping the sample is representative. Analysis proceeds from interview notes rather than precise quotations. Results lose verifiability — a quote in the final report is not a verbatim transcript but a paraphrase from memory or a notebook. These compromises are understandable but methodologically problematic.
How Automatic Transcription Changes the Research Process
Batch processing of recordings changes the timeframe from weeks to hours. Fifty recordings sent to a transcription system return as transcribed text within an hour or two — depending on material length and system capacity. A researcher who previously waited for transcription can begin coding the same day the last interview took place.
This change in tempo is not cosmetic. Customer research typically has a concrete deadline — the product team is waiting for results to decide on the next sprint or priorities. Every week spent transcribing is a week when analysis is not happening. Faster transcription shifts the critical path of the research project: the researcher spends time on what delivers value — coding, interpreting, synthesising — not on mechanical transcription.
Automatic transcription is data preparation, however, not data analysis. The researcher still must read the transcript, understand context, code themes, and interpret findings. Transcription shortens the time from recording to coding but does not change the methodology or difficulty of the analytical work itself.
Speaker Diarization — A Prerequisite for Working with Quotations
For research interviews, speaker diarization is a near-essential condition for transcript usability. Diarization assigns each utterance to a specific speaker — in an in-depth interview, this means clear distinction between interviewer and respondent; in a focus group, tracking dynamics among multiple participants.
Without diarization, the transcript is a block of text without attribution. The researcher can still read the transcript and code themes, but finding a specific quote — "What exactly did respondent P3 say about onboarding?" — becomes a laborious search through the entire document. Coding tools like Atlas.ti or NVivo allow filtering segments by speaker; without diarization, this function loses its purpose.
Diarization accuracy decreases with the number of speakers and the degree of overlapping speech. An in-depth interview with two speakers is relatively straightforward for diarization; a focus group with eight participants who interrupt each other is significantly more challenging. Researchers should anticipate this imprecision and plan for a higher rate of correction with focus group material.
Transcription Accuracy as a Condition for Reliable Coding
Transcription accuracy directly affects how reliably respondents can be quoted. A general rule for research use: transcription with accuracy below ninety percent requires extensive correction before coding, because errors distort the meaning of statements. Accuracy above ninety-five percent allows direct coding with occasional correction when cross-referencing against the recording.
A specific problem in research interviews is specialised terminology. Product names, brand terms, internal jargon, or abbreviations from a particular industry belong to a vocabulary that transcription models trained on general language do not know and systematically mangle. The solution is a custom vocabulary imported into the transcription system before processing — a list of terms, names, and expressions specific to the given research context.
A multi-model ensemble approach — transcription through multiple engines with results merged into a single final version — can improve accuracy. Speaker diarization is an integrated part of processing. For sensitive research projects where data must not leave the organisation, local transcription — processing on one's own machine without data transfer to a third party — is the appropriate choice.
Integration with Qualitative Analysis Tools
Research teams work with data in tools like Atlas.ti, NVivo, or Dovetail. All three accept text transcripts as a primary data source for coding.
Atlas.ti and NVivo allow importing a transcript and pairing it with an audio or video file. If the transcript contains timecodes — each sentence or segment marked with its position in the recording — the researcher can click through to the corresponding point in the recording during coding. This is a practical advantage for context verification: the transcript says "that doesn't work for us at all," but the researcher wants to hear the tone in which the respondent said it. Dovetail is designed for UX research and integrates transcription with notes, tags, and analytical templates directly in a single environment.
Transcript format matters. Plain text (TXT) is the most universal but loses structure. A transcript with timecodes in SRT or VTT format is usable for media linking. JSON export with structured segments (each segment containing text, start time, end time, speaker ID) is the richest format for programmatic processing or automated import into an analytical tool with interview metadata.
Coding Themes from Transcripts
Coding is the heart of qualitative analysis: the researcher goes through the transcript and labels passages with relevant themes or codes. Deductive coding starts with codes predefined based on research questions; inductive coding lets codes emerge from reading the material itself.
Both methods require a good transcript. Errors in the transcript interrupt reading, force the researcher to look things up in the recording, and slow down the coding process. Irony or sarcasm captured as a direct statement leads to incorrect coding. Quotations pulled from an imperfect transcript cannot go into the final report without verification.
An accurate transcript with diarization enables searching across the entire interview corpus: how many times respondents mentioned a specific term, how different respondent segments reacted to a given question, where themes recur or where they are unique.
Practical Tips for Better Results
Transcript quality depends on recording quality. A good microphone, a quiet room, and sufficient distance between speakers and noise sources are the foundation. In conditions where this can be controlled — a lab usability test, a moderated interview — investing in recording quality is worthwhile.
Introducing speakers at the beginning of the recording facilitates both diarization and manual navigation: "We are beginning an interview with Martina; the interviewer is James." Structured interview guides with clearly separated topics lead to more predictable vocabulary and easier transcription and coding. A custom terminology glossary imported before transcription significantly reduces errors in product names and brand terms.
A quick review of the transcript before coding is worthwhile. Reading the entire text is not necessary — but a quick scan catches systemic errors such as an incorrectly diarized speaker or a recurring mangling of a specific term, which can then be fixed with find-and-replace.
Limits of Automatic Transcription in Research
Transcription cannot fix what the recording did not capture. The emotional tone of a response, hesitation before a sensitive question, the disconnect between verbal agreement and nonverbal signals — these are absent from the text. A researcher working exclusively with the transcript misses a layer of data that may be interpretively important. Practice therefore combines transcription with timestamps and returning to the recording at interesting passages — especially in sections where the researcher is uncertain during analysis.
Respondent data protection is a concrete question with automatic transcription. Research interviews contain sensitive information: attitudes, behaviours, personal experiences that respondents shared in the context of research consent. Sending this data to a cloud transcription server must comply with GDPR and with the consent conditions respondents signed. For sensitive projects, local transcription without data transfer to a third party is the right choice.
Conclusion
Automatic transcription changes customer research not by replacing the researcher's analytical judgment, but by removing the mechanical bottleneck that stood in the way of analytical work. Thirty-seven hours of recordings processed in an afternoon instead of four weeks of manual transcription shifts the research project in time and enables analysis of the entire dataset, not just a sample. The prerequisite is transcription with sufficient accuracy, speaker diarization, and correct terminology — and acceptance that transcription is data preparation, not data interpretation.
Sources:
- Braun, V., Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77–101.
- MAXQDA (2023). Working with Transcripts in Qualitative Data Analysis. Methodological guide.
- Dovetail (2024). Research repository documentation — transcript import and tagging.
- NVivo (2023). Importing transcripts and linking to media files. QSR International documentation.
- GDPR — Regulation (EU) 2016/679, Art. 5–9 (processing of sensitive personal data).