Mandarin text-to-speech in 2026: dialect routing across MiniMax 2.8 and Qwen3-TTS

Chinese text-to-speech in 2026 is in a different place than it was in 2024. The biggest shift is not naturalness, where most leading models cleared the bar a long time ago. The biggest shift is dialect coverage and tone accuracy: which dialects of Chinese are supported, how reliably the model handles polyphones in standard Mandarin, and how cleanly it switches between Chinese and English mid-sentence.

Two models inside the AI text-to-speech tool together cover most of the modern Chinese landscape: MiniMax Speech 2.8 from MiniMax, and Qwen3-TTS from Alibaba's Qwen team. They overlap in some places and diverge sharply in others. Picking between them is less a question of which one sounds better and more a question of which kind of Chinese you are producing.

Tool not found

ai-tts

Why Chinese is harder than Spanish or French

The structural reasons Chinese TTS is harder than most Romance languages are worth restating, because they explain why the two models differ where they do.

First, tone. Standard Mandarin uses four lexical tones plus a neutral tone. The same syllable in different tones is a different word: mā (mother), má (hemp), mǎ (horse), mà (scold). A model that produces the wrong tone produces the wrong word, not just an awkward delivery. Tone errors are rare in modern neural TTS, but they happen, and they are catastrophic when they do.

Second, polyphones. Chinese characters often have multiple pronunciations determined by the surrounding context. The character 行 is pronounced háng when it means "row" or "profession" and xíng when it means "to walk" or "okay". A model trained primarily on data-driven patterns will sometimes pick the wrong pronunciation in unusual contexts. Recent academic work probing TTS models reports that polyphone handling remains a weak spot industry-wide.

Third, tone-3 sandhi. When two third-tone syllables appear in sequence, the first shifts to a near-second-tone in spoken Mandarin. The model has to recognize the pattern and apply the shift; native speakers do it without thinking; learners and TTS models both miss cases. Modern models handle most sandhi rules correctly but produce inconsistent results in chains of three or more third-tone syllables.

Fourth, code-switching. Chinese business content, technical writing, and social media routinely mix English words and acronyms inline. "我用 Python 写了一个 script" is a normal sentence in many contexts. A TTS model has to detect the switch, change phonological models, deliver the English word in English phonology, and switch back. Models that train primarily on monolingual data struggle here.

Each of those four constraints maps onto one or both of the models in this tool. Knowing which model handles which constraint best is the routing decision.

A four-row diagram showing the four core challenges of Chinese text-to-speech (tone, polyphones, tone sandhi, code switching) with a one-line description of each, in a clean editorial slate-and-cream palette

MiniMax Speech 2.8 for broadcast Chinese

MiniMax Speech 2.8 is the model to reach for when your project needs a deep voice library, broadcast-grade audio, and standard Mandarin or Cantonese coverage with character voice variety. The strengths:

The voice library is the largest in the tool by a wide margin. MiniMax publishes more than 300 Chinese voices in the underlying catalog: news anchors, sweet-voiced narrators, mature women, gentle men, optimistic youth, and a sub-library of regional accents on top. The AI text-to-speech tool surfaces a curated subset of those voices, with both standard Mandarin and Cantonese available. For projects where you want to cast a voice from a deep bench (audiobooks, ad reads, branded podcasts, narrators that vary by chapter), no other model in the catalog matches the depth.

The tonal handling was rebuilt from the ground up in version 2.8. The MiniMax team specifically called out that tonal languages benefit from the rework, and the practical effect is that polyphones and complex sandhi cases land more reliably in 2.8 than in the 2.6 predecessor. Voice cloning is supported from 3 to 10 seconds of reference audio, with cloning preserving the speaker's timbre across language switches.

The HD variant of the model is broadcast-grade. The Turbo variant trades quality for speed and price, which matters for high-volume work but is generally not the right pick for any project where the audio is the deliverable. For most Chinese broadcast use cases, HD is the answer; Turbo is the fallback when latency or budget forces it.

Where MiniMax falls short on Chinese: dialect coverage outside Mandarin and Cantonese. The model handles those two well; it does not produce reliable Hokkien, Wu, or other regional dialects. If your script is in a non-standard Chinese variety, MiniMax is not the right pick.

Qwen3-TTS for dialect coverage and accuracy

Qwen3-TTS, released by Alibaba's Qwen team in January 2026, is the model for dialect work and the model that posts the lowest published Chinese word-error rate. The strengths:

Dialect coverage is the standout feature. Qwen3-TTS officially supports Mandarin, Cantonese (Hong Kong and Guangdong variants), Hokkien (the Southern Min dialect spoken in Fujian and Taiwan), Wu (the Shanghai and Suzhou dialect cluster), Sichuanese, Beijing dialect, Nanjing dialect, Tianjin dialect, and Shaanxi dialect. Two of the named voices in the catalog are explicitly dialect speakers: Dylan delivers in Beijing dialect, and Eric delivers in Sichuanese. For projects that need authentic regional Chinese, no other model in the tool gets close.

Accuracy on standard Mandarin is also industry-leading. The published Qwen3-TTS technical report measures word error rate at around 0.77 on the Seed-TTS Chinese test set, which is lower than the comparable numbers from MiniMax-Speech and ElevenLabs Multilingual v2 on the same test. Speaker similarity on Chinese is measured at around 0.799 to 0.811 depending on model size, again the highest in the comparison set. For projects where the script depends on getting tones right (educational content, formal narration, content where mispronunciation embarrasses the brand), Qwen3 has the strongest published numbers.

Code-switching is the third Qwen3 strength. The Qwen team's published cross-lingual stability tests put Qwen3 ahead of MiniMax-Speech and GPT-4o Audio on Chinese-English mixed-language inputs. For business content, technical scripts, and any modern Chinese writing that blends English terms inline, Qwen3 produces cleaner transitions.

The Base variant of Qwen3-TTS adds 3-second voice cloning, which the CustomVoice variant does not have. If your project needs a clone from a short Chinese-speaker reference, Qwen3 Base is the lowest-friction path.

Where Qwen3 falls short: voice library size. The CustomVoice variant ships with nine preset voices in the AI text-to-speech tool. The underlying model has 49 voice timbres in the broader catalog, but the curated subset on this tool is smaller than MiniMax's. If your project needs a voice library deep enough to cast every chapter of an audiobook with a different narrator, Qwen3 alone does not give you that.

A dialect-routing decision matrix

The two models cover most of what a Chinese-language project needs, but they cover different things well. A working decision matrix:

Standard Mandarin, broadcast quality, deep voice library. MiniMax Speech 2.8 HD. The 300-plus voice library, the broadcast-tuned audio, and the rebuilt tonal handling produce the most polished result for general Mandarin work.

Standard Mandarin, lowest possible error rate. Qwen3-TTS CustomVoice. The lowest published WER on Chinese, the strongest published speaker similarity, and the cleanest polyphone handling on technical and educational content.

Cantonese broadcast. Either works. MiniMax has more voice variety; Qwen3 has dialect-aware generation that is closer to the regional pronunciation patterns. Generate a representative passage in both and pick the one that sounds correct to a native speaker.

Hokkien, Wu, Shanghainese, or any non-Mandarin non-Cantonese Chinese variety. Qwen3-TTS only. MiniMax does not produce reliable output in these varieties.

Sichuanese, Beijing dialect, Tianjin dialect, Shaanxi dialect, Nanjing dialect. Qwen3-TTS, with the named dialect voices (Dylan for Beijing, Eric for Sichuanese) being the most direct path. The model accepts the dialect tag and produces output that native speakers recognize as authentic.

Chinese-English code-switched scripts. Qwen3-TTS. Published benchmarks favor it on cross-lingual stability, and the difference is audible on mid-sentence English insertions in technical content.

Voice cloning of a Chinese speaker. Qwen3-TTS Base for short references (3 seconds). MiniMax Speech 2.8 for longer references where you want broadcast-grade output and have 10 seconds of clean audio.

Multi-character drama or scene work in Mandarin. Either model, depending on cast size. MiniMax's voice library makes character variety easier; Qwen3's accuracy makes formal dialogue land cleaner.

Tone sandhi and polyphones in practice

The published WER numbers are reassuring at the population level, and they hide the cases that matter most in production. A handful of practical observations from working with both models on Chinese scripts:

Tone-3 sandhi works most of the time, in both models. Chains of three third-tone syllables in a row (the canonical hard case) produce the right shift in roughly 90 percent of attempts on either model. The 10 percent that miss are sometimes the highest-stakes lines (proper nouns, brand names, salutations). Generate the script, listen to chains explicitly, and re-generate the affected lines if the sandhi sounds off. There is no automatic fix in either model interface for sandhi correction.

Polyphone errors are rarer than they used to be but not zero. Common multi-pronunciation characters (行, 长, 重, 还, 当) hit correctly in most contexts. Less common polyphones in unusual surrounding text occasionally produce the wrong reading. Both models handle the easy cases. Neither handles the rare cases reliably enough to skip a final listen-through on important content.

For tonal accuracy, Qwen3 has the edge on quantitative tests. For voice library and broadcast polish, MiniMax has the edge on output quality. For unusual content (technical writing with English terms, formal speech with archaic vocabulary, regional dialect work), Qwen3 is the safer pick. For mainstream Mandarin content (audiobooks, ad reads, branded podcasts), MiniMax is the safer pick.

What I do for Chinese projects

The workflow that has settled in for me, after running both models on a few hundred Chinese-language projects:

Default to MiniMax 2.8 HD for any Mandarin or Cantonese project where the voice library and broadcast polish carry the work. That covers the bulk of mainstream Chinese audio.

Switch to Qwen3-TTS for any project that involves a non-standard dialect, a code-switched script, voice cloning from a short reference, or a script where the tonal accuracy needs to be measurably correct (educational content, formal narration, brand work where mispronunciation embarrasses).

For projects where I am not sure, generate the same 200-character passage in both models and listen on the device the audience will use. The right model is usually obvious within two or three minutes of comparison. The wrong model is the one I would have picked from reading the marketing pages alone.

The good news is that 2026 Chinese TTS is a real two-model toolbox. The bad news is that you have to know which tool reaches for which job. Once you do, the output quality on either side is genuinely strong.

Tool not found

ai-tts

Tool not found

ai-audio-to-audio

AI 声音克隆

克隆任意声音，生成多语言语音

Z.Toolsz.tools

Page Not Found · Z.Tools

The page you're looking for doesn't exist or has been moved.

Mandarin text-to-speech in 2026: dialect routing across MiniMax 2.8 and Qwen3-TTS