WebSocket and Live Transcription in the Browser: How It Works
Live transcription directly in the browser without installing software is not magic — it is the result of combining three technologies: WebSocket for a persistent connection to the server, the MediaRecorder API for capturing audio from the microphone, and transcription models running in the background. This article explains how these parts work together and why transcription accuracy is always a trade-off between speed and quality.
HTTP Polling vs WebSocket — Why It Matters
How Classic HTTP Polling Works
Traditional web architecture operates on a request-response basis: the browser sends a request, the server responds, and the connection closes. For static pages, this approach is perfect. For real-time applications where data changes every few hundred milliseconds, it is problematic.
For a browser to follow transcription progress in a classic HTTP architecture, it must repeatedly ask the server: "Are there new results?" Each such request carries overhead: HTTP headers are typically 200 to 800 bytes, plus latency from TCP handshake and server-side processing. At a polling frequency of once every half second, that means hundreds of unnecessary requests per minute of transcription — and the user still waits for the response.
Long polling partially solves this problem: the server holds the connection open until it has data, then responds and closes the connection. The browser immediately opens a new one. Latency drops, but the architectural inefficiency remains — each new piece of information still requires a new connection.
How WebSocket Changes the Communication Approach
WebSocket (RFC 6455, published 2011) approaches the problem differently. It initiates communication as a standard HTTP request with an Upgrade: websocket header. Once the server responds with 101 Switching Protocols, the HTTP connection transforms into a persistent bidirectional channel.
From this point, the server can send data to the client at any time — without waiting for a request from the browser. A WebSocket message carries overhead of only 2 to 10 bytes compared to hundreds of bytes for HTTP. The resulting latency drops to the minimum of the transmission path: typically 20 to 100 ms depending on the server's geographic distance. For real-time transcription, where every hundred milliseconds matters in the user experience, this is a fundamental difference.
How the Browser Captures Audio
getUserMedia — Accessing the Microphone
Microphone access in the browser is provided by the standard Browser API described in the W3C Media Capture and Streams specification. The method navigator.mediaDevices.getUserMedia() returns a MediaStream object — a stream of audio data from the selected input device.
The browser displays a permission prompt to the user before granting access. Without explicit consent, audio cannot be captured; this behaviour is enforced by the specification and cannot be bypassed. A second technical requirement is HTTPS: browsers block microphone access on insecure pages. Chrome introduced this policy from version 47, and other browsers adopted it gradually. Running a transcription application without a valid SSL certificate means the microphone will not work at all.
If the user has multiple microphones or audio inputs connected, the API returns a list of available devices via enumerateDevices() and the user can choose. The web application can then save this selection for the next session.
MediaRecorder API — Recording in Chunks
MediaRecorder accepts a MediaStream and continuously saves or transmits it in segments. The timeslice parameter determines the interval in milliseconds at which the ondataavailable event fires. Setting timeslice: 500 means a new chunk of audio data is generated every 500 ms.
Each chunk is a Blob object with audio data in the configured MIME format. Modern browsers most commonly support audio/webm;codecs=opus — Opus is a lossy codec optimized for speech that significantly reduces data volume while maintaining intelligibility. However, transcription APIs do not always accept WebM/Opus; some require PCM or WAV. In that case, conversion is necessary, either on the browser side using AudioContext or on the server side before passing data to the model.
In the ondataavailable handler, the chunk is sent via WebSocket: ws.send(chunk.data). This single line of code connects audio capture with server transmission.
AudioContext and Audio Pre-Processing
The Web Audio API offers deeper control over the audio signal before sending. AudioContext allows processing the audio buffer directly in JavaScript through nodes connected in a graph.
One of the most important steps is downsampling. Microphones capture audio typically at 44.1 or 48 kHz — frequencies optimized for music. Speech transcription needs only 16 kHz; higher frequencies add no recognizable information for ASR models but increase the volume of transmitted data. Downsampling before sending reduces data volume by 60 to 70% without affecting transcription quality. AudioWorkletNode (the modern replacement for the older ScriptProcessorNode) performs this pre-processing asynchronously without blocking the browser's main thread.
The Journey of Sound from Microphone to Text
Sending Chunks via WebSocket
A chunk captured by the MediaRecorder API is sent as a binary WebSocket message. The server reads the incoming binary data and stores it in a buffer for the transcription model. Chunks that are too small — under 100 ms — cause problems: the model lacks sufficient context for correct word recognition at segment boundaries.
Alongside audio data, metadata can be transmitted as text WebSocket messages: information about the transcription language, session ID, custom glossary, or user context. The transcription model then uses this metadata for better recognition of specific terminology.
The Transcription Model Processes Segments
The server passes the received audio buffer to the transcription model. Modern ASR APIs (Whisper, Deepgram, Soniox, and others) return two types of results: partial (interim, may still change) and final (confirmed after processing the complete segment). This distinction is crucial for correct UI display.
The so-called "word boundary problem" occurs when a word lies exactly on the boundary of two chunks: the first chunk captures the first syllable, the second chunk captures the rest. Without context, the model interprets each part independently and the result is incorrect. The solution is a short overlap: each new chunk begins 100 to 200 ms before the end of the previous one. The server thus always provides the model with context from both sides of the boundary.
Results Back to the Browser
The server sends results as JSON messages via WebSocket: {"type": "partial", "text": "hello how are", "timestamp": 3.45}. The browser processes the message in the ws.onmessage handler and updates the UI. Partial results are typically displayed with distinct formatting — italics or lighter text — so the user knows the text may still change. Once a final result arrives, it replaces the partial text and receives standard formatting.
The resulting latency between speaking a word and its appearance on screen ranges from 300 to 800 ms. This is the sum of: audio chunk size + network transmission + model processing + result transmission back. Each of these steps contributes its share to the total response time.
Latency and Accuracy — an Unavoidable Trade-Off
How Chunk Size Affects Results
Audio chunk size is the most important parameter that the developer directly controls. A 250 ms chunk delivers low latency — the user sees text almost immediately — but the model has minimal context for recognition. Error rates rise especially for longer words and expressions at segment boundaries.
A 500 ms chunk is a good compromise for most applications: latency stays under one second and accuracy is significantly better. A 1,000 ms chunk pushes accuracy even higher — the model has sufficient context for robust recognition — but latency of 1 to 2 seconds can be noticeable during text dictation. For live subtitles, such latency is usually unacceptable; for dictating emails or documents, it is perfectly fine. Research shows that real-time ASR accuracy with chunks over 800 ms approaches batch processing results (Han et al., 2020).
Networks, Buffers, and Connection Stability
Unstable WiFi or mobile connections cause jitter: chunks arrive at the server out of order or delayed. The server must implement a reorder buffer — a queue that waits for chunks in the correct order before passing data to the model. Without this buffer, the model would process audio segments out of chronological order and results would be unusable.
The WebSocket client in the browser should implement automatic reconnection on connection loss. The standard approach includes exponential backoff: first reconnection attempt after 1 second, second after 2 seconds, third after 4 seconds, and so on. An outage longer than 2 to 3 seconds causes loss of audio context; a quality implementation saves state and continues from the last confirmed final segment.
Conclusion
Live transcription in the browser is architecturally clean but full of subtle details. WebSocket provides a low-latency connection, the MediaRecorder API captures audio in chunks, AudioContext pre-processes the signal, and the transcription model returns results in real time. Each of these steps introduces latency — and each can be optimized.
Understanding the technology behind live transcription helps with decision-making: chunk size, microphone choice, network conditions — all of these affect what you ultimately see on screen. Live transcription in the browser is accessible and reliable — if you know what to expect from it.
Sources:
- Fette, I., & Melnikov, A. (2011). The WebSocket Protocol. RFC 6455. IETF. https://datatracker.ietf.org/doc/html/rfc6455
- W3C. (2021). Media Capture and Streams. W3C Recommendation. https://www.w3.org/TR/mediacapture-streams/
- W3C. (2021). MediaStream Recording. W3C Working Draft. https://www.w3.org/TR/mediastream-recording/
- Han, K. J., et al. (2020). ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context. Interspeech 2020. https://arxiv.org/abs/2005.03191
- W3C. (2021). Web Audio API. W3C Recommendation. https://www.w3.org/TR/webaudio/