Whisper vs Deepgram vs AssemblyAI: WER claims vs production reality

Every transcription provider quotes a sub-7% WER on their landing page. The number is real on the benchmark and irrelevant to your audio. Here is how to think about provider choice when the numbers are this slippery.

Og Image

If you have shopped for transcription services in 2026, you have seen the marketing chart. Every provider has one. Their model wins. The competing models are 30 percent worse. The benchmark is a clean dataset. The audio in your project is not.

This piece is about the gap between the number on the chart and the number you will see in production, and what that gap should change about how you pick a provider.

音频转文字

音频转文字

将音视频转为文本,支持字幕导出

What the headline numbers actually say

Briefly, here is the 2026 leaderboard, with sources noted because the providers move these numbers around.

OpenAI's GPT-4o-Transcribe sits at roughly 2.46 percent Word Error Rate on optimal-condition benchmarks. Sub-5 percent across most clean-audio test sets. Currently the highest-accuracy proprietary model.

Deepgram Nova-3 reports 5.26 percent WER on batch-mode audio and 6.84 percent on streaming, measured across 2,703 production files spanning nine domains. Median latency around 450 milliseconds for streaming, p95 under 300 milliseconds. Best in class for live captioning, voice assistants, and any real-time use.

AssemblyAI Universal-2 lands around 8.4 percent WER across diverse datasets, with a notable claim of 30 percent fewer hallucinations versus Whisper Large-v3.

OpenAI Whisper Large-v3 (the open-weights model) scores 9.0 percent WER on Common Voice 15 multilingual data, 2.7 percent on the LibriSpeech clean test set, and 17.7 percent on call-center recordings. The 99-language coverage is its biggest selling point.

If you only looked at the marketing pages, you would conclude GPT-4o-Transcribe wins, Whisper loses, and the difference is huge. The conclusion is half right.

Why the headline numbers do not predict your results

Three reasons.

Benchmark audio is not your audio. LibriSpeech, the most-cited transcription benchmark, is professional voice actors reading audiobooks in a quiet studio with one speaker. Common Voice is volunteer-recorded sentences, also single-speaker, also relatively clean. Real podcasts have music beds, two or more speakers interrupting each other, room noise, microphone variance, and editing-induced cuts. Real meetings have someone on a phone via VoIP, someone else with a laptop microphone, someone third on a Bluetooth headset across town. The benchmark conditions and the conditions you submit are barely related.

WER aggregates errors that have very different costs. A 5 percent WER could mean every transcript has 5 misspellings of "the" (annoying but readable) or every transcript has 5 mis-rendered proper nouns (which destroy your search index). Two systems with identical WER can be radically different in the kinds of errors they make. Hallucinations (the model fabricating a sentence that was never spoken) count the same in WER as a single character typo, but they are catastrophically different in downstream impact.

Provider claims are generally honest about their best case. When AssemblyAI says 8.4 percent WER, they mean it averaged across their evaluation set. They are not lying. Your specific audio may be 4 percent (better than their benchmark) or 18 percent (worse than their benchmark). Without testing on your own audio, the headline number is a directional indicator at best.

Grouped bar chart showing per-provider WER across four audio conditions: clean studio audiobook, single-speaker podcast, multi-speaker meeting, call-center recording. The provider rankings shuffle across conditions, with no provider winning every category

Where each provider actually wins

Removing the marketing layer, here is what the available evidence suggests.

GPT-4o-Transcribe wins on clean, well-recorded audio with one or two speakers and standard English. The accuracy is real. The pricing is higher than alternatives and the latency is not optimized for streaming. Best for: post-production transcription of recorded interviews, podcasts with quality audio, lectures.

Deepgram Nova-3 wins on streaming and real-time use. The latency is genuinely the lowest of the major providers. Accuracy is strong on telephony and noisy audio. The multilingual streaming support (10 languages simultaneously without routing) is a real engineering achievement. Best for: live captioning, voice agents, real-time captions for accessibility.

AssemblyAI Universal-2 wins on noisy real-world audio and on the hallucination-resistance metric. The lower hallucination rate matters more than the headline WER for legal, medical, and journalism use cases where a fabricated sentence is worse than a missed word. Best for: high-stakes accuracy, content where invention is unacceptable.

Whisper Large-v3 (or hosted variants) wins on language coverage and on cost when self-hosted. Open weights, runs on a GPU you own, no per-minute charges, no data leaving your infrastructure. The 99-language list is the broadest available. Best for: multilingual content, privacy-sensitive workloads, high-volume processing where per-minute fees would dominate.

A four-quadrant matrix with axes labeled "audio cleanliness" (low to high) and "real-time need" (low to high). Each quadrant lists the recommended provider with a one-line rationale: clean+batch -> GPT-4o-Transcribe; clean+streaming -> Deepgram; noisy+batch -> AssemblyAI; multilingual+self-host -> Whisper

How to actually compare providers for your use case

The right comparison is not "look at our benchmarks" but "process your audio with each option and read the results." That sounds tedious; it takes about an hour and saves months of pain.

A practical protocol:

  1. Pick three audio samples that represent your real workload. One easy (clean recording), one medium (typical case), one hard (noisy or multi-speaker).
  2. Submit each to two or three transcription tools.
  3. Read the outputs carefully, paying attention to: proper noun spelling, technical jargon, transitions between speakers, segments where the model went quiet vs. fabricated, timestamp accuracy.
  4. Note total cost per minute for each.
  5. Pick the one that gives you the smallest cleanup workload at an acceptable cost.

Most teams that do this end up with a different provider than they would have picked from the marketing pages. The reasons vary; the pattern is consistent.

What this means for the tool on this site

The audio transcription tool here uses a Whisper-style API surface (the parameter names match the Whisper conventions), but the actual provider behind the API is not relevant to the user choice; what matters is whether the output works for your audio. The same evaluation protocol applies: try it on a real sample, look at the results, decide.

The tool's pricing is per-minute, the file cap is 100 MB, the language list covers the same 99 languages that Whisper supports, and the output formats are the standard set (TXT, JSON, SRT, VTT, detailed JSON). Where it sits in the WER landscape will depend on your audio. The only way to know is to send a sample through.

A quick decision shortcut

If you do not have the time for a full comparison and need a starting recommendation:

  • Streaming or real-time use: try Deepgram first.
  • Multilingual or self-hosted: try Whisper first.
  • High-stakes accuracy where invention is unacceptable: try AssemblyAI first.
  • Clean, well-recorded English audio with budget: try GPT-4o-Transcribe.
  • "I just need a transcript right now and I do not want to think about it": try the most-convenient option (a hosted Whisper-style API like the one on this site), then evaluate properly if the results disappoint.

The big lesson stays the same. The headline WER numbers are real. They are also a starting point, not a verdict. Test on your audio.

继续阅读