Mandarin text-to-speech without breaking the tones

The sentence "你好" sounds right in any Mandarin TTS system worth its salt. The phrase "我觉得很好" should sound right but often does not, because the third tone on "我" needs to shift to a rising tone before another third tone, and not every model implements that rule cleanly. The phrase "他重新打开重要的文件" is a minefield, because "重" is one of dozens of polyphonic characters whose pronunciation depends on what it means in the sentence, and the model has to figure that out from context.

Mandarin TTS is not in the same category as English TTS. The model has to handle four lexical tones, a fifth neutral tone, several context-dependent tone shifts, hundreds of polyphonic characters, and how the tones change when they sit next to each other. Modern neural models do most of this, most of the time, but the failure modes are specific and the mitigation strategies are different from English.

This is the practical guide.

What goes right

Modern neural Mandarin TTS handles the basic case well. Standard Putonghua-flavored prose with vocabulary at or below the news-broadcast register comes out cleanly. The models have plenty of training data on this kind of input and produce natural intonation, correct lexical tones in isolation, and reasonable prosody at the sentence level. For most narration scripts that read like written Mandarin (full sentences, clear punctuation, standard vocabulary), the audio is acceptable on the first generation.

A modern Mandarin TTS catalog typically offers six to twenty voices, covering both genders with options across the warmer-narrator and crisper-announcer profiles. Picking a voice that fits the script, the same way you would in English, is the first decision.

What goes right covers most short-form content: product copy, news reads, app announcements, course narration, two-minute marketing scripts. If your script falls in this band, the failure modes below mostly do not apply, and you can ship without much intervention.

Where it goes wrong

Three categories of error account for almost all the audible mistakes in Mandarin TTS output.

Tone sandhi. Mandarin has explicit phonological rules about how tones change when they sit next to each other. The most common: a third tone (low-falling-rising) followed by another third tone shifts to a second tone (rising). So "你好", both syllables phonemically third tone, is actually pronounced "ní hǎo" in real speech, not "nǐ hǎo." Modern neural models pick up the most common sandhi rules from training data, but sandhi is not always cleanly applied. Long sequences of third tones (three or four in a row) are particularly tricky, because the rule depends on syntactic grouping, not just adjacency. A model that handles "你好" correctly may mishandle "我也很好," where the rule applies twice with different grouping options.

There is a separate sandhi rule for "一" (yī) and "不" (bù), where the tone of these characters shifts depending on the tone that follows. "一个" (yí ge), "一天" (yì tiān), "不是" (bú shì), "不要" (bú yào), the model usually gets these right because they are extremely common in training data, but in less common collocations the model can fall back to the citation form and produce slightly off audio.

Polyphonic characters. A meaningful fraction of Mandarin characters have more than one pronunciation, and which pronunciation is correct depends on what the character means in the sentence. "重" can be "zhòng" (heavy, important) or "chóng" (again, repeat). "行" can be "xíng" (to walk, OK) or "háng" (a row, a profession). "长" can be "cháng" (long) or "zhǎng" (to grow, to lead). "了" can be "le" (the aspect particle) or "liǎo" (to understand, to finish). Modern models use context to disambiguate, and they get the common cases right, but the edge cases stay genuinely hard. A sentence where the polyphonic character occurs in an unusual context can come out wrong, and the listener notices immediately because the character is reading as the wrong word.

Code-switched English. Mandarin scripts that include English words or phrases (technology terms, brand names, abbreviations) put the model in a hard situation. The dominant approach in current models is to read the Mandarin in Mandarin and then attempt to read the English in something approximating an English accent. The result is sometimes acceptable and sometimes lands as transliterated Pinyin reading of English letters, which is jarring. Acronyms (CPU, USB, AI) tend to come out in spelled-letter form, which is correct, but the timing and accent of the spelled letters varies by voice and is not always natural.

A three-column reference card showing the three failure categories, sandhi, polyphony, code-switching, each with a one-sentence description, a short example phrase, and a note on whether the issue is "usually fine," "sometimes fails," or "needs intervention." Color-coded for severity

How to script around the failure modes

The fix is in the script. Modern Mandarin TTS does what is asked of it; the audible mistakes mostly come from asking ambiguously. There are a small number of script-level practices that catch most of the failure modes before they reach the audio.

Read the script aloud yourself first, even if your Mandarin is rusty. A native or near-native speaker reading the script on their own will instinctively apply sandhi and pick the right polyphonic readings. Hearing yourself say it gives you a reference for what the audio should sound like. When the TTS output diverges from your reading, you have caught a problem before it ships.

Replace ambiguous polyphonic characters with their disambiguated alternatives where possible. Chinese has rich vocabulary, and most polyphonic characters have synonymous or near-synonymous alternatives that are unambiguous. "重新" can become "再次" if the context allows. "了解" stays "了解" (always pronounced "liǎo jiě") rather than depending on the model getting "了" right in some other usage. This is not always possible without changing meaning, but where it is possible, it eliminates a class of failure.

Watch for long third-tone sequences. Three or four third tones in a row are a sandhi minefield. A sequence like "我也很满意" is technically "wǒ yě hěn mǎn yì", three or four third tones in a row depending on how you group it, and even good models can produce slightly wrong tonal groupings. If you can rephrase to break up the sequence, do.

For English in Mandarin scripts, decide ahead. Either commit to reading the English term in clean English ("AI" spelled as the letters, "USB" the same), accept that the model's English reading may sound transliterated, or replace the English term with its Chinese equivalent ("人工智能" instead of "AI"). Mid-sentence English in Mandarin is the situation that produces the most listener complaints; pick the path that fits the script and stick with it.

Punctuate explicitly. Mandarin punctuation conventions are slightly different from English (the comma "，" vs the enumeration comma "、," the full stop "。," the question mark "？," the exclamation "！"), and modern TTS uses the punctuation as a strong prosodic signal. A script that uses correct Mandarin punctuation produces noticeably better intonation than a script that uses Western-style punctuation throughout.

When to break the script into chunks

Long Mandarin scripts have the same general weakness that long English scripts have, the model can drift in pacing, energy, or even subtle pronunciation as the input gets longer. The thresholds are different, but the strategy is the same: break long scripts into 500-1500 character chunks, generate each chunk with the same voice and settings, and concatenate the audio in your editor.

In Mandarin specifically, the chunks should break at natural prosodic boundaries: paragraph breaks, scene breaks, topic shifts, questions or exclamations that act as section closers. The model produces a slight reset of pacing at each chunk boundary, and you want that reset to land at a place where the listener's brain is already expecting a beat.

Avoid breaking a chunk in the middle of a long compound sentence, in the middle of a list, or right before a question. The audio will feel slightly choppy at the seam.

Voice variety across Mandarin catalogs

Most mainstream providers split their Mandarin voices into two groups by gender, with both groups offering multiple options across the warmer-narrator and crisper-announcer profiles. Female voices skew toward the warmer end on average; male voices skew toward more neutral and crisper, but each group has variety inside it.

For a long-form narration project (audiobook draft, course narration, podcast), pick from the warmer voices and stick with them. For news reads, product announcements, or any script where neutral authority matters more than warmth, the announcer profile fits better. For child-facing or brand-soft content, the warm female voices are usually the right call.

Character and stylized Mandarin voices are still rare across most catalogs, the way the English catalogs have started to ship them. ElevenLabs and a handful of regionally-focused providers (Microsoft Azure with its Cognitive Services voices, Tencent Cloud, Alibaba Cloud) ship more variety than the OpenAI-style voice families. If your project specifically needs a character voice (a comedic role, a deeply theatrical reading voice, a voice with strong regional flavor), the broad-spectrum providers are the better starting point. The smaller catalogs are calibrated for clean, professional-broadcast Putonghua reading.

A practical end-to-end recipe

For a Mandarin script of around three thousand characters (roughly six minutes of audio), the recipe that works:

Write the script in clean Mandarin with correct punctuation.
Read it aloud yourself or have a native speaker read it; listen for places where you stumble.
Replace any polyphonic characters where a synonym is available and natural.
Decide the policy for any English terms in the script (read as English, replace with Chinese equivalent, or accept the transliterated reading).
Generate a thirty-second test passage including the trickiest sentences. Listen carefully.
If the test passes, generate the script in chunks of around 1500 characters at natural prosodic breaks. Use the same voice and settings on every chunk.
Concatenate in an editor. Master the audio to the destination's loudness target. Confirm the seams between chunks are not audible.
Listen to the full result one more time, ideally on a phone speaker, before publishing. Mandarin TTS errors that are subtle on studio monitors can be obvious on cheap speakers.

The total intervention time is small relative to the cost of re-generating the whole project after discovering a tone problem in chapter three. Mandarin TTS is fast and cheap; it rewards the time spent on script review and test generation more than English does, because the failure modes are denser and harder to fix in post.

What I think the model is good at, in plain language

Modern Mandarin TTS is now good enough for most professional uses. The voices sound natural, the intonation is broadly correct, the four lexical tones are accurate in isolation, and the most common sandhi rules are applied correctly more than ninety percent of the time. The remaining gaps are real but specific: long third-tone sequences, edge-case polyphonic characters, code-switched English, and the relative absence of character voices on most catalogs. These are mitigable with script-level care or by picking a provider that has invested specifically in voice variety.

The mistake to avoid is treating Mandarin TTS as if it works exactly like English TTS. The tonal layer adds a class of failure modes that does not exist in English, and the script-level practices that reliably produce good output in Mandarin are different from the practices that work in English. Plan accordingly. The audio at the end is worth the script-level care.

Mandarin text-to-speech without breaking the tones

What goes right

Where it goes wrong

How to script around the failure modes

When to break the script into chunks

Voice variety across Mandarin catalogs

A practical end-to-end recipe

What I think the model is good at, in plain language

Mandarin text-to-speech in 2026: dialect routing across MiniMax 2.8 and Qwen3-TTS

Audio tags across Eleven v3, xAI Text-to-Speech, and Dia 1.6B: three syntaxes, three results

How to pick from eleven AI text-to-speech models for one script

What goes right

Where it goes wrong

How to script around the failure modes

When to break the script into chunks

Voice variety across Mandarin catalogs

A practical end-to-end recipe

What I think the model is good at, in plain language

继续阅读

Mandarin text-to-speech in 2026: dialect routing across MiniMax 2.8 and Qwen3-TTS

Audio tags across Eleven v3, xAI Text-to-Speech, and Dia 1.6B: three syntaxes, three results

How to pick from eleven AI text-to-speech models for one script