Skip to main content
July 3, 2026 · By Piyush Sahoo Voice transcription is the process of converting spoken audio into written text, and the technology that does it is called ASR (automatic speech recognition). It is the layer that lets a voice AI agent understand what a caller just said, lets a contact centre search a million recorded calls, and lets a meeting tool produce a readable summary. If you are building anything that listens to a phone call, ASR is the part that turns sound into something software can act on. This guide goes well past the dictionary definition: how an ASR system actually works from raw audio to words, the difference between streaming and batch transcription, how accuracy is measured with Word Error Rate (WER), the features that separate a usable transcript from a wall of text, and the one factor most articles skip, why the quality of the underlying telephony audio decides how accurate ASR can ever be. That last point is where the infrastructure layer lives, and it is why a clean 24 kHz stream beats a muddy 8 kHz one before the model even runs.
Key takeaways
  • Voice transcription (ASR) is the conversion of speech audio into text by an automatic speech recognition model; it is the “ears” of any voice AI system.
  • A modern ASR pipeline is audio → feature extraction → acoustic + language modelling (or one end-to-end transformer) → decoding → text.
  • Streaming transcription returns words within a few hundred milliseconds for live voice AI; batch transcription processes a whole recording afterward for analytics.
  • Accuracy is measured by Word Error Rate (WER) — the percentage of words inserted, deleted, or substituted versus a reference.
  • The biggest accuracy lever is the audio itself: narrowband 8 kHz telephony, noise, and packet loss raise WER long before the model is the bottleneck. Legacy real-world telephony WER often runs 40–50%.
  • Vobiz is the transport and recording layer that delivers clean 24 kHz audio to your ASR — it powers the speech-to-text stack, it does not replace it.

What is voice transcription (ASR)?

Voice transcription is the conversion of spoken language into written text. Automatic speech recognition (ASR) is the machine-learning technology that performs that conversion without a human typist. Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methods enabling the recognition and translation of spoken language into text by computers. The one distinction that matters: ASR is not the same as voice identification. ASR answers “what was said”; speaker recognition (or voice biometrics) answers “who said it.” A transcription system produces the words; a separate step called diarization labels which speaker produced which words. Conflating the two is the most common mistake in this space. It also helps to be precise about where ASR sits. A full voice AI turn is a chain: ASR (speech to text) → an LLM that decides what to say → TTS (text to speech) that speaks the reply. Vobiz is the telephony infrastructure carrying audio in and out of that chain — it moves the media and records it cleanly. The ASR model itself is provided by engines like Deepgram, OpenAI Whisper, or Sarvam, and the agent logic by platforms like Vapi or Retell. Understanding ASR means understanding that one link in the chain.

How ASR works

A modern ASR system turns a stream of audio samples into a sequence of words in roughly four stages. Older systems split the job across separate models; newer ones fold most of it into a single neural network, but the conceptual stages are the same.

1. Audio capture and feature extraction

Sound enters as a waveform — thousands of amplitude samples per second. Raw samples are noisy and high-dimensional, so the system first extracts features: compact numerical representations of the audio that emphasise the parts of the signal that carry speech. The classic representation is a set of Mel-frequency cepstral coefficients (MFCCs) or a mel spectrogram, computed over short overlapping windows (typically 10–25 ms each). This is the step where audio quality bites: if the signal arriving here is band-limited or noisy, the features are degraded, and no downstream model can recover information that was never captured.

2. Acoustic modelling

The acoustic model maps those audio features to the smallest units of sound, called phonemes (a language like English has roughly 40). Acoustic models statistically represent the relationship between an audio signal and the phonemes that make up speech. Historically this was done with Hidden Markov Models (HMMs) combined with Gaussian mixtures; HMMs model speech as a sequence of states with probabilistic transitions, and they dominated ASR for decades. Today the acoustic model is almost always a deep neural network.

3. Language modelling

Acoustics alone are ambiguous: “recognize speech” and “wreck a nice beach” sound nearly identical. The language model resolves that ambiguity by scoring how likely a given word sequence is in the target language and domain. It is the difference between a phonetically plausible transcript and a grammatically sensible one, and it is why a model tuned on medical dictation transcribes drug names that a general model mangles.

4. Decoding (and the end-to-end shift)

The decoder searches the combined acoustic and language scores for the most probable word sequence, emitting the final text. In classic pipelines these three models are trained and tuned separately. The big change since the mid-2010s is end-to-end ASR, where a single neural network learns the whole mapping from audio to text. Two architectures dominate: models trained with Connectionist Temporal Classification (CTC), which aligns input audio frames to output characters without requiring a pre-segmented transcript, and transformer / sequence-to-sequence models. OpenAI’s Whisper is a well-known example: an encoder-decoder transformer trained on 680,000 hours of multilingual, multitask supervised data collected from the web, which made robust multilingual transcription broadly available. End-to-end models fold the acoustic and language modelling into one network, which is why most production ASR engines today are a single model rather than a three-stage stack.

Streaming vs batch transcription

There are two fundamentally different ways to run ASR, and choosing the wrong one breaks the use case. The split is about when you get the text. Streaming (real-time) transcription processes audio continuously as it arrives and emits words within a few hundred milliseconds, often as “partial” hypotheses that get refined as more context arrives. This is non-negotiable for voice AI agents: the agent cannot answer a question it has not finished hearing, and a one-second transcription delay added to LLM and TTS latency makes the conversation feel broken. Streaming ASR is fed over a live bidirectional audio stream — for telephony that means a WebSocket media stream carrying the call audio to the ASR engine as it happens. Batch (offline) transcription processes a complete recording after the fact. Because it can see the entire file, it can use more context and heavier models, so it is typically more accurate per word and is the right tool for post-call analytics, compliance archives, and search over recorded calls. Batch ASR runs against the audio file produced by call recording.
DimensionStreaming transcriptionBatch transcription
When text appearsWhile the person is speakingAfter the call/recording ends
LatencyHundreds of millisecondsSeconds to minutes
Context windowLimited (must commit early)Full recording
Typical accuracySlightly lower (less context)Slightly higher
Audio sourceLive WebSocket streamStored recording
Best forLive voice agents, agent assist, captionsAnalytics, QA, compliance, search
Interruptible (barge-in)YesN/A
A real stack often uses both: streaming ASR to drive the live agent, then batch ASR over the recording afterward for a more accurate transcript and analytics.

Measuring accuracy: Word Error Rate (WER)

The standard metric for ASR accuracy is Word Error Rate (WER). It compares the machine transcript against a human reference transcript and counts the edits needed to fix it. WER is derived from the Levenshtein distance at the word level and is computed as the sum of substitutions, deletions, and insertions divided by the number of words in the reference:
WER = (S + D + I) / N

S = substitutions   D = deletions   I = insertions   N = words in the reference
A WER of 10% means one word in ten is wrong; lower is better, and a perfect transcript is 0%. WER is widely used precisely because it is simple and comparable, but it has known limitations: it weights every word equally, so a missed “not” (which flips meaning) counts the same as a missed “the,” and it penalises harmless paraphrases. It is a proxy for usefulness, not usefulness itself — but it is the proxy everyone reports.

What drives WER up

Most of the things that wreck transcription accuracy are not the model. They are the input:
  • Narrowband telephony audio. Traditional phone calls are sampled at 8 kHz, capturing only frequencies up to ~3.4 kHz. That throws away the high-frequency energy that distinguishes consonants like s, f, and th — exactly the sounds ASR confuses most. Wideband audio at 16 kHz or 24 kHz keeps that information.
  • Background noise and reverberation. Street noise, cross-talk, and echo bury the speech signal and inflate WER sharply.
  • Accents, dialects, and code-switching. Models trained mostly on one accent degrade on others; mixed-language speech (common in India) is harder still.
  • Packet loss and jitter. On a poor VoIP path, dropped or late audio packets create gaps the model fills with garbage.
  • Domain vocabulary. Names, drug names, SKUs, and jargon absent from training data get substituted with common words.
This is why legacy telephony ASR routinely sees real-world WER of 40–50% even when the same model scores in the single digits on clean studio audio. The model did not get worse — the audio did.

ASR features that matter

Raw transcribed words are rarely enough. The features below turn a transcript into something a human or an LLM can actually use.
FeatureWhat it doesWhy it matters
Speaker diarizationLabels who spoke each segment (“Speaker 1 / Speaker 2”)Separates agent from caller; partitions an audio stream into segments by speaker identity
Punctuation & casingAdds full stops, commas, capitalisationMakes transcripts readable and parseable by downstream LLMs
Word-level timestampsMarks the start/end time of each wordEnables search, captions, and jumping to a moment in a recording
Custom vocabulary / boostingBiases the model toward your termsRecovers names, products, and jargon the base model misses
Confidence scoresPer-word probability the transcript is rightLets you flag low-confidence spans for review
Number & entity formatting”two thirty pm” → “2:30 PM”Cleaner data for analytics and automation
Profanity / PII redactionMasks sensitive contentCompliance for recorded and stored transcripts
Language identificationDetects and switches language mid-callHandles multilingual and code-switched calls
For voice agents, the input side of this lives in Gather speech input detection, where the platform collects caller speech and hands it to recognition.

How telephony affects ASR accuracy

Here is the part most “what is ASR” articles miss, and it is the whole game for phone-based transcription: the ceiling on ASR accuracy is set by the audio, not the model. You can wire up the best speech-to-text engine in the world, and if you feed it a narrowband, compressed, packet-dropping phone stream, it will still produce a bad transcript. Garbage in, garbage out — literally, frame by frame. It is worth being precise about why a phone call is such hostile input for a speech model, because each stage of the telephony path removes or distorts information that ASR depends on, and the losses compound. Below are the six telephony factors that move WER the most, roughly in order of impact.

1. Narrowband sampling discards the consonant band

The legacy phone network is built around 8 kHz sampling. By the Nyquist–Shannon sampling theorem, an 8 kHz sample rate can represent frequencies only up to 4 kHz — and in practice the telephone channel is band-limited even further, to the classic 300–3,400 Hz voice band. That passband was chosen in the analog era to make speech intelligible to humans, not to a machine. The problem is that the acoustic cues distinguishing fricatives and sibilantss, f, th, sh — carry most of their energy above 4 kHz. Narrowband telephony simply throws that energy away, so the model literally never receives the signal that separates “fifteen” from “sixteen,” or an s plural from a singular. This is the single largest pre-model accuracy lever. Delivering wideband 16 kHz or, better, 24 kHz audio preserves that band and gives the model materially more to work with.
Careful caveat: upsampling 8 kHz audio to 16 kHz before sending it to a wideband model does not recover the lost detail — the information was destroyed at capture. It can actually hurt, because the model now sees a wideband container with an empty top half. The fix is capturing and transporting wideband audio end to end, not resampling narrowband after the fact.

2. Lossy codecs and transcoding add quantisation noise

Phone audio is compressed for transport, and the codec leaves fingerprints the model has to see through. The PSTN standard, G.711, uses 8-bit logarithmic μ-law/A-law companding at 64 kbit/s — already a quantised approximation of the waveform. Lower-bitrate codecs (G.729 at 8 kbit/s, GSM) compress far harder and introduce spectral distortion that ASR models, often trained on clean PCM, are not used to. Worse is transcoding: a call that hops Opus → G.711 → Opus across carrier boundaries is re-compressed at each leg, and the losses are generational — every transcode degrades the signal again, like photocopying a photocopy. A path with the fewest codec hops and a high-fidelity codec preserves the most signal for the model.

3. Packet loss and jitter create gaps the model fills with garbage

Telephony media rides VoIP, and on a congested path packets arrive late (jitter) or not at all (loss). To avoid audible silence, endpoints run packet loss concealment (PLC), which synthesises plausible audio to bridge the gap. That synthetic fill is convincing enough for a human ear but is not real speech — and ASR transcribes it as phantom words or drops real ones. The jitter buffer that smooths playback also adds delay, which eats into the latency budget a live agent needs. A low-latency, single-hop media path minimises both loss and jitter, so the model sees continuous, in-order audio.

4. Mono mixing destroys speaker separation

How the call is recorded matters as much as how it is transported. If both parties are mixed into a single mono channel, then whenever they talk over each other the waveforms sum, and the model — plus any diarization step — has to untangle two voices from one signal. Dual-channel (stereo) recording keeps each leg of the call on its own track, so the caller and the agent are transcribed independently and overlap stops being a problem. For phone-based ASR, per-leg audio is one of the cheapest accuracy wins available.

5. Noise, echo, and signal-processing artifacts

Background noise and reverberation raise WER faster than almost anything else, because they bury the speech the feature extractor is trying to isolate. The telephony stack’s own DSP can help or hurt here: aggressive automatic gain control (AGC), echo cancellation, and silence suppression / comfort noise can clip or mangle quiet consonants. The right answer is native noise cancellation in the media path that cleans the signal before it reaches the model, rather than leaving the ASR engine to fight noise it should never have received.

6. The train/test mismatch nobody controls for

Underneath all of the above sits one principle from acoustic modelling: a model performs best on audio that resembles what it was trained on. Most modern ASR engines are trained heavily on wideband, low-noise data. Feed them narrowband, companded, noisy telephony and you have a textbook acoustic mismatch — the same model that scores single-digit WER on studio audio can degrade to 40–50% WER on a legacy phone path. The model did not get worse; the input drifted away from what it learned. Closing that gap means delivering audio that looks more like the training distribution: wideband, denoised, and intact. The takeaway: every one of these is a property of the telephony layer, decided before the ASR model ever runs. This is precisely the layer Vobiz owns. We do not build the ASR model — we make sure the audio reaching it is as clean as telephony allows, and we move it with the latency budget a live agent needs.

How Vobiz handles transcription

Vobiz is the telephony infrastructure beneath your transcription stack — the transport, recording, and audio-quality layer that powers whichever ASR engine you choose. It is not a speech-to-text model and not an AI agent; it is the rails that get clean audio to and from them.
  • Clean, AI-grade audio. Bidirectional audio streaming at 24 kHz with native noise cancellation, so your ASR engine receives wideband, de-noised speech instead of muddy 8 kHz narrowband — the difference between single-digit and 40–50% real-world WER.
  • Direct-carrier path, fewer codec hops. Single-hop, direct-carrier connectivity minimises the transcoding that compounds losses across carrier boundaries, so the model sees the highest-fidelity signal the call can carry.
  • Streaming transport for real-time ASR. Live call audio is delivered over WebSocket bidirectional streaming via Stream, so streaming ASR can return partial words in real time and your agent can barge in.
  • Recording for batch transcription. Native call recording produces the stored audio your batch ASR and post-call analytics run against, available from recordings — keeping each call leg clean for accurate diarization.
  • Built for the latency budget. Sub-80 ms single-hop, direct-carrier transport keeps the ASR → LLM → TTS turn under the ~1 second a natural conversation allows (legacy CPaaS paths hit 300–400 ms before the model even starts). It also minimises jitter and packet loss, so PLC rarely has to invent audio the model will mis-transcribe.
  • It powers your ASR, it does not replace it. Bring Deepgram, Whisper, Sarvam, or any engine, and connect agents like Vapi, Retell, ElevenLabs, or Pipecat through the integrations hub. Vobiz moves and records the audio; you own the intelligence.
In short: the accuracy of your transcription is decided long before your ASR model runs, in the media path that carries the call. Vobiz is that media path.

Frequently asked questions

They describe the same outcome from two angles. Voice transcription is the result — speech turned into written text. ASR (automatic speech recognition) is the technology that produces that result automatically, without a human typist.
On clean, wideband audio, modern ASR engines often achieve single-digit WER (under 10%). On real-world telephony — narrowband 8 kHz audio with noise and accents — WER is much higher, historically 40–50% on legacy paths. Improving the audio quality is usually the fastest way to lower WER.
Streaming transcription returns text within a few hundred milliseconds as the person speaks, which is required for live voice agents. Batch transcription processes a complete recording afterward, can use more context, is usually slightly more accurate, and is used for analytics and compliance.
Because traditional telephony is 8 kHz narrowband audio, which discards the high-frequency detail that distinguishes consonants, and phone calls add background noise, packet loss, and accents. The model is rarely the bottleneck — the audio is. Delivering clean 24 kHz audio to the ASR engine is the biggest fix.
Yes. Lossy codecs like G.711 (μ-law/A-law) and especially low-bitrate codecs add quantisation noise the model has to see through, and transcoding — re-compressing the call across multiple carrier hops — degrades the signal generationally. Fewer codec hops and a higher-fidelity codec preserve more of the speech the ASR engine needs.
No. Upsampling cannot recover frequency detail that was never captured; the consonant energy above 4 kHz was lost at the 8 kHz sampling stage. It can even hurt, because a wideband model then sees an empty upper band. The only real fix is capturing and transporting wideband (16–24 kHz) audio end to end.
Dual-channel (stereo), where the caller and agent are on separate tracks. Mono mixes both voices into one waveform, so overlapping speech is hard to untangle and diarization suffers. Per-leg audio lets each side be transcribed independently and is one of the cheapest accuracy wins for phone-based ASR.
Vobiz is the telephony infrastructure layer — it transports and records the call audio and delivers a clean 24 kHz stream to whichever ASR engine you choose (Deepgram, Whisper, Sarvam, and others). It powers your speech-to-text and voice-AI stack rather than replacing it.

Further reading on Vobiz

Sources

Build on Vobiz

Provision a number and stream clean 24 kHz audio to your ASR engine in minutes.