Whisper across 99 languages: where it shines, where it doesn't

The selling point of Whisper, the open-weights transcription model from OpenAI, is that it works in 99 languages. The number is real. Whisper Large-v3 was trained on roughly 5 million hours of audio across that language inventory and can produce a transcript for almost any spoken human language you point it at.

The number also hides a structural reality: the accuracy spread across those 99 languages is enormous. The model that gets near-perfect English transcription gets meaningfully worse on French, noticeably worse on Mandarin, and visibly broken on languages further down the list. If you are picking a transcription approach for a multilingual workload, the question is not "does Whisper support my language" (almost always yes). The question is "what accuracy tier is my language in, and is that tier good enough for my use case."

This is the field guide.

音频转文字

将音视频转为文本，支持字幕导出

The honest framing: Whisper's languages fall into roughly four tiers based on training data abundance and demonstrated WER on standardized benchmarks. The tier names are mine; the data is from public benchmarks (Common Voice 15, FLEURS, low-resource language papers).

Tier 1, production-ready (WER ~5-10% on real audio). English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Polish, Mandarin Chinese, Japanese, Korean. The model has the most training data here. Output is good enough for most production uses with a light human review.

Tier 2, usable with caveats (WER ~10-18% on real audio). Catalan, Czech, Danish, Finnish, Greek, Hungarian, Indonesian, Norwegian, Romanian, Slovak, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Hebrew, Arabic. Output is readable but needs careful proofreading. Proper nouns and jargon error rates climb. Domain-specific use (medical, legal) requires a human pass.

Tier 3, best-effort (WER ~18-30% on real audio). Bulgarian, Croatian, Lithuanian, Tamil, Hindi, Bengali, Persian, Latvian, Serbian, Slovenian, Estonian, Marathi, Telugu, Welsh. Use cases are limited to "rough draft for a human translator" or "search index over messy data." Not suitable for publishing without heavy editing.

Tier 4, demonstration-grade (WER 30%+). Pashto, Punjabi, Urdu, Sinhala, Khmer, Burmese, Lao, Yoruba, Shona, Hausa, Sundanese, Javanese, and most of the long tail. The model produces something, often something worse than nothing. Empirical research on Pashto, Punjabi, and Urdu specifically has shown that vanilla Whisper performance for these languages is well below the threshold most users assume. Few-shot fine-tuning helps but still rarely brings these languages into Tier 1 or 2.

A reminder: these are rough bands, not exact rankings. Performance varies by audio quality, accent, dialect, and topic. A clean studio recording in Tier 3 might outperform a noisy field recording in Tier 1.

A world map color-coded by tier: Tier 1 in dark green (most of Europe, US, China, Japan, Korea), Tier 2 in lighter green (Eastern Europe, Southeast Asia, parts of MENA), Tier 3 in amber (parts of South Asia, the Caucasus), Tier 4 in faded gray (parts of Sub-Saharan Africa, Central Asia, the Pacific)

Why the tiers exist

Three forces shape where each language lands.

Training data abundance. Whisper learned from publicly available web audio. Languages with massive web presence (English, Mandarin, Spanish) have orders of magnitude more training material than minority languages. The model has seen more of how those languages sound across accents, contexts, and recording conditions.

Phonetic and orthographic complexity. Some languages are harder for any model to handle. Tone-based distinctions (Mandarin tones, Vietnamese tones), agglutinative grammar (Finnish, Turkish), or non-Latin scripts with optional diacritics (Arabic, Hebrew) raise the difficulty floor regardless of training data.

Dialect spread. Languages with many regional variants (Arabic, Mandarin, Hindi-Urdu) face a harder problem because the "language" is actually a family of dialects. The model has to perform across all of them, which spreads its training signal thinner.

These forces interact. Mandarin, despite being a tonal language with regional variants, sits in Tier 1 because the training data abundance overcomes the difficulty. Welsh, despite being a relatively phonetically simple Indo-European language, sits in Tier 2-3 because the training data is comparatively scarce.

A four-row matrix showing each tier with: example languages, typical WER band, recommended workflow (light proofread / careful proofread / scaffold-only / commission human transcription)

What this means for picking a workflow

The tier of your audio's language should drive how you use the transcription output, not just whether you use it.

Tier 1 audio. Generate the transcript, do a light proofread, ship it. The output is good enough that a human reviewer is fixing 1-2 errors per minute, not rewriting the transcript. Use cases: production captions, podcast transcripts, meeting notes for distribution.

Tier 2 audio. Generate the transcript with a domain prompt that lists key terms in the target language. Have a fluent reviewer go through carefully. Plan for review time roughly equal to the audio duration. Use cases: customer support transcripts, internal recordings, subtitle drafts that go to a human translator.

Tier 3 audio. Treat the transcript as a starting scaffold. A reviewer will rewrite roughly 30 percent of it. Useful for search indexing, quick understanding of long-form content, and providing structure for a human transcription pass. Not suitable for publication without heavy editing.

Tier 4 audio. Honest answer: do not rely on auto-transcription as the primary workflow. Either commission human transcription, fine-tune a custom model on representative audio (which requires real ML investment), or wait for the next generation of multilingual models. Auto-transcription in this tier is faster than nothing but slow enough that a human transcriber is often comparable in throughput.

The translation option, briefly

The translation feature on Whisper-style APIs translates source-language audio into English transcripts. It is one-directional: it will turn Mandarin audio into an English transcript, but it will not turn English audio into Mandarin.

The accuracy of the translation depends on both the source language tier and the inherent translation difficulty. For Tier 1 source languages, the translation is roughly equivalent to "good machine translation": fluent, mostly accurate, sometimes loses idiom or nuance. For Tier 3-4 languages, the translation inherits all the source-language transcription errors and adds translation errors on top, often producing output that is grammatically English but semantically scrambled.

When to use translation:

You need a quick understanding of foreign-language content for an English-speaking audience.
The source is in Tier 1 or 2.
The downstream use is "get the gist" rather than "publish as the final translation."

When not to use it:

Source language is Tier 3 or 4: errors compound.
Translation needs to be publication-grade: hire a human translator.
The source has technical or legal content: precision matters more than speed.

Code-switching and bilingual content

A specific failure mode worth naming: audio that switches between languages mid-conversation (a Mandarin podcast with English technical terms, a Spanish interview with English brand names, a Japanese meeting with English code-named projects).

Whisper handles code-switching unevenly. The model picks one primary language at the start of the audio and tries to apply it throughout, which means the secondary-language phrases often come out garbled or transliterated rather than transcribed in their actual language.

Workarounds:

Set the primary language explicitly to whichever you want to dominate the output.
Add the secondary-language terms to the prompt so the model handles them at least with consistent spelling.
For heavily bilingual audio, do two passes: once with each language as primary, then merge the results manually.

There is no clean "auto-handle code-switching" mode in current models. This is a known limitation.

A practical recommendation

For most multilingual workflows in 2026:

Identify which tier your source language is in.
Decide what your accuracy bar is for the use case.
If the tier and the bar match, run the transcription and do the appropriate review.
If they do not match, change the workflow: human transcription for Tier 4, professional translation for cross-language work, or a different provider that has invested specifically in your target language.

Whisper's 99-language list is a real achievement. It is also a starting point, not a substitute for matching tool to task. The tier framework is an honest middle ground between "Whisper supports your language" and "use a different tool."

For the audio transcription tool on this site, which uses a Whisper-style API surface, the same tier intuition applies. The pricing per minute is the same regardless of the source language; the output quality is not. Plan accordingly.

Z.Toolsz.tools

Audio Transcription · Z.Tools

Convert audio and video to text with subtitles

Whisper across 99 languages: where it shines, where it doesn't

音频转文字

How accuracy stratifies across the language list

Why the tiers exist

What this means for picking a workflow

The translation option, briefly

Code-switching and bilingual content

A practical recommendation

Audio Transcription · Z.Tools

Whisper vs Deepgram vs AssemblyAI: WER claims vs production reality

The Whisper prompt parameter: what it actually does (and doesn't)

Mandarin text-to-speech without breaking the tones

音频转文字

How accuracy stratifies across the language list

Why the tiers exist

What this means for picking a workflow

The translation option, briefly

Code-switching and bilingual content

A practical recommendation

Audio Transcription · Z.Tools

继续阅读

Whisper vs Deepgram vs AssemblyAI: WER claims vs production reality

The Whisper prompt parameter: what it actually does (and doesn't)

Mandarin text-to-speech without breaking the tones