Transkripce

Transcription Export Formats: TXT, SRT, VTT, JSON, CSV — What to Choose When

Transcription is done. Which format to download? TXT is the simplest, but you lose timestamps and speaker information. JSON carries everything, but requires processing. SRT is for subtitles but not for archiving. The choice of format determines what you can do with the transcript next — and what you lose forever.


Why Format Matters More Than It Seems

Formats are not just different wrappers for the same content. Each carries different data, and information missing from the file cannot be reconstructed.

The transcription model produces more than just text. Each word has a timestamp (when it was spoken), a confidence score (how certain the model is), and a speaker identifier (who said it). This information exists in the machine output — but not every export format preserves it.

Example: you download TXT without timestamps. Three months later you need to add subtitles to the video — you need SRT with timestamps. Timestamps are not in the TXT → they cannot be added retrospectively. You must either run the transcription again or download SRT from the start.

Golden rule: download in the richest format your environment can handle. Simplifying is always possible. Adding missing data is not.


Five Formats and Their Uses

TXT — Plain Text

TXT is text and nothing else. No timestamps, no confidence score, no structured speaker identification — if any, then as a plain text separator. The file opens in any text editor without special software.

Suitable for: reading the transcript, manual editing in a text editor or CMS, inserting into a document, show notes, simple readable archive.

Not suitable for: subtitles (timestamps missing), programmatic integration (structure missing), archiving with metadata for later processing.

SRT — Standard for Video Subtitles

SubRip Subtitle is the most widely used subtitle format. Every platform and player accepts it. Each segment has a sequence number, start and end timestamps, and text. The format is simple, text-based, with no additional metadata.


1
00:00:01,000 --> 00:00:04,200
Good afternoon, welcome to today's lecture.

2
00:00:04,800 --> 00:00:06,100
Thank you for the invitation.

Suitable for: video subtitles on YouTube, VLC, DaVinci Resolve, Premiere Pro, orientational transcript timeline with synchronisation.

Not suitable for: archiving (transcript metadata, confidence score, speaker identification missing), text analysis (sequence numbers and timestamps complicate reading).

VTT — Web Standard

WebVTT (Web Video Text Tracks) is the format defined by W3C for web players. Its structure is similar to SRT, but additionally supports text formatting (bold, italic), comments, speaker identification in some implementations, and segment metadata.


WEBVTT

00:00:01.000 --> 00:00:04.200
Good afternoon, welcome to today's lecture.

Suitable for: web video playback (HTML5 <track> element), web accessibility (WCAG 2.1 standard), YouTube alternative to SRT, custom web players.

Advantages over SRT: native browser support without plugins, richer formatting options, metadata.

JSON — Complete Data

JSON carries everything the transcription model produces: words, word-level timestamps, confidence score, speaker identifiers, transcript metadata (language, model, recording length, processing parameters).


{
  "words": [
    {"text": "Good", "start": 1.0, "end": 1.2, "confidence": 0.98, "speaker": "SPEAKER_1"},
    {"text": "afternoon", "start": 1.2, "end": 1.6, "confidence": 0.99, "speaker": "SPEAKER_1"}
  ]
}

Suitable for: programmatic integration into applications, custom analytics tools, archiving with complete data for later processing, conversion to any other format, import into QDA tools (MAXQDA, ATLAS.ti) via custom script.

Not suitable for: manual reading or editing without a specialist tool (JSON is readable but impractical for direct work).

CSV — For Spreadsheet Processing

CSV is a tabular format: rows correspond to transcript segments, columns carry individual attributes (start time, end time, text, speaker, confidence). The file opens in Excel or Google Sheets.

Suitable for: tabular analysis across transcripts, statistical processing (confidence distribution, segment length, speaker share), reports, database import.

Not suitable for: subtitles, direct reading of transcript text, archiving with hierarchical metadata.


Decision Overview

I need to... Format Why
------------- -------- -----
Read the transcript TXT Simplest, opens everywhere
Video subtitles SRT Most widely supported standard
Web subtitles VTT HTML5 native standard
Integrate into application JSON Complete structured data
Analyse in a spreadsheet CSV Tabular format
Archive completely JSON Everything in one place
Create other formats from one source JSON Source format for conversions

What Can and Cannot Be Converted

From a richer format to a simpler one is always possible. In the other direction data is missing — it cannot be reconstructed.

Lossless conversion from JSON: JSON → TXT (you lose structure, keep text), JSON → SRT (timestamps are in JSON), JSON → VTT (same as SRT), JSON → CSV (tabular extract from JSON data).

Lossy conversion (caution): TXT → JSON (timestamps missing → cannot be added). SRT → JSON (confidence and metadata missing → cannot be added). CSV → JSON (depends on columns — if timestamps are present, conversion is possible; if not, it cannot be done).

Practical consequence: JSON is the safest archival format. TXT, SRT, or CSV can always be generated from JSON. JSON cannot be reconstructed from TXT or SRT.


Czech Transcription System exports to TXT, JSON, SRT, CSV, and VTT. Archiving recommendation: JSON as the primary format, preserving complete results including per-word confidence, timestamps, and speaker information. For subtitles export SRT or VTT. For editing in a text editor export TXT.


How export formats serve subtitle creation is described in the transcription and subtitles overview A11. The web interface and API offer different options for configuring the export format A26.


Sources: