How to pick from eleven AI text-to-speech models for one script
Eleven AI text-to-speech models in one tool is paralyzing. Three filters in this order narrow the catalog to one or two right answers in under a minute.
Eleven AI text-to-speech models on one page is paralyzing. The first time I opened the model picker on the new AI text-to-speech tool, I read the names, scrolled the catalog, played a short preview on whatever sat at the top, and shipped. That is the move most people make. It is fine for a quick test. It is not fine when the model you picked turns out to cost twenty times more than the one you should have picked, or to take three times longer to start speaking, or to silently truncate your script to 2,000 characters when the next chapter is 8,000.The eleven models in the catalog are not interchangeable. They differ on price by roughly forty times across the cheapest and most expensive. They differ on first-audio latency by an order of magnitude. They differ on language coverage from "English only" to "70 plus". A few do things no other model does. Multi-speaker dialogue with real laughter. Voice cloning from three seconds of audio. Dialect-aware Chinese narration. Knowing which one fits the script in front of you is a one-minute decision once you have the right framing.::blog-tool{slug="ai-tts"}::This is the framing that has worked for me.### Stop comparing models. Start describing the scriptThe mistake most people make on a multi-model page is comparing the models. The right move is to describe the script first. The script tells you which filters to apply, and the filters narrow eleven options to one or two before you listen to a single preview.Three filters do almost all of the work:1. Does the audio need to play back in real time, or can it sit on a server for a few seconds while it generates?2. What language is the script in, and does it have any tricky dialect or pronunciation needs?3. What kind of script is this? Long-form narration, conversational chatbot reply, multi-speaker dialogue, voice cloning for a specific person, or short copy with stylistic emphasis.Apply them in that order. The first filter eliminates most of the catalog if your answer is "real time". The second eliminates most of what is left if your script is in a language with thin coverage. The third tells you which of the remaining two or three models is the right one for the actual job.### Filter 1: real time or offlineIf the audio needs to start playing within 200 milliseconds of the request, like voice agents, chatbots, phone bots, or live translation, only two models in the catalog reach that target reliably. Eleven Flash v2.5 reports roughly 75 milliseconds of model inference time on short inputs. Inworld TTS 1.5 Mini reports under 130 milliseconds at the 90th percentile. Both are designed for streaming over a persistent connection.Everything else in the catalog is offline-tier. Studio narration, podcasts, audiobooks, scripted videos, e-learning, voiceover for ad reads. All of those are jobs where the audio can take three to ten seconds to render and nobody notices. For those scripts, latency is not a filter.::doc-callout{type="warning" title="Network and queueing eat the model number"}The latency figures above are model-inference time only. Real-world end-to-end latency includes the network round trip from your client, queueing on shared GPU infrastructure, audio encoding, and your own application overhead. Most production voice-agent latency is not the model. It is the rest of the pipeline. A 75-millisecond model behind a slow gateway can still be a 600-millisecond user experience.::If your script is borderline (a customer-success greeting the user does not have to wait for, a podcast intro that gets generated overnight), treat it as offline. Real-time is reserved for "the user is listening for the next word right now."### Filter 2: language and dialectThis is where the catalog spreads out the most. The dozen most-spoken global languages are well covered by most models. Long-tail languages, regional dialects, and code-switched scripts are where the choice narrows fast.For Mandarin Chinese with broadcast quality, MiniMax Speech 2.8 has more than 300 Chinese voices and a high-fidelity mode aimed at broadcast use. For Mandarin with the lowest character-error rate, Qwen3-TTS leads. Published numbers put it at around 1.835 percent average word error rate across ten languages, and the lowest WER on Chinese specifically. Qwen3-TTS also covers Cantonese, Hokkien, Wu, Sichuanese, and several northern dialects through a single model. If the script switches between Chinese and English mid-sentence, Qwen3 handles that transition more cleanly than the alternatives.For Hindi, Inworld added Hindi support in the 1.5 release. For 70-plus languages including a long tail of European and Southeast Asian options, Eleven v3 has the widest coverage, though its character cap per request is the strictest in the catalog. For English with non-verbal expressiveness like laughter, sighs, and throat-clears, Dia 1.6B is the only model that genuinely renders those as audio rather than substituting text-like sounds.Run the language filter ruthlessly. Generic "supports 30 languages" claims often mean "the top dozen are good and the rest are passable". Verify on a representative passage before committing.### Filter 3: what kind of scriptOnce latency and language have narrowed the catalog, the script type tells you the right answer.Long-form narration in a single voice. Eleven Multilingual v2 is the workhorse. 10,000 characters per request, predictable neutral prosody, low surprises. It is what a three-hour audiobook chapter wants. The newer Eleven v3 produces more emotional range but caps at 3,000 characters per request on this tool, which means more chunking on long scripts.Audiobook with emotional range, voice acting, multi-speaker scenes. Eleven v3 was built for this. The 5,000-character limit on the direct provider API and 3,000 on this tool is a real constraint, but the audio tags and multi-speaker handling are where it earns its quality.Multi-speaker dialogue with non-verbal cues. Dia 1.6B is the only model in the catalog that renders
(laughs) and (coughs) and (sighs) as actual audio events rather than text approximations. It is English only, and it caps at 3,000 characters, but for a podcast scene with two speakers and real laughter, nothing else in this catalog does it.Voice cloning from a short reference clip. Qwen3-TTS Base clones from a 3-second sample. MiniMax Speech 2.8 supports cloning from longer references. Both are inside this tool. Voice cloning is also the area where consent and disclosure are non-optional. See the next section.Short scripts with inline emphasis tags. xAI Text-to-Speech ships with five named voices, full inline-tag support (pauses, whispers, slow segments, emphasis), and prices around $0.0042 per 1,000 characters. It is the cheapest model in the catalog by a wide margin and the most expressive per dollar for short copy. The trade-off is the 8,000-character cap and a smaller voice roster than the heavyweights.A 50,000-character chapter you do not want to chunk. Only four models accept that size: MiniMax Speech 2.8, Eleven Flash v2.5, and Eleven Multilingual v2 at 10,000. Everything else needs splitting.### A 90-second test workflowThe mistake most people make at the testing stage is generating the first three sentences of the script in three or four candidate models and picking the one that sounds nicest. The first three sentences of any script sound fine in almost any modern TTS model. The mistakes show up later, in long compound sentences, in lists with numbers, in proper nouns, in transitions where the emotional tone shifts.A workflow that actually catches problems:1. Pull a 200-word passage from the actual script. Not the opening. Pick a paragraph in the middle that has a number, a proper noun, and at least one shift in tone.2. Apply the three filters in order. You should end with two candidate models, not five.3. Generate the full passage in both. Listen end-to-end. Listen at 1.0 speed, not 1.5.4. Hand the audio to one person who has not seen the script. If they can repeat the passage back to you accurately, the model passes the comprehension test.The whole thing takes about a minute and a half once the script is in the editor. That is the cheapest way I have found to avoid re-generating a forty-minute project in a different model three days later.### Where the leaderboards help, and where they do notPublic TTS leaderboards (the Artificial Analysis Speech Arena, for one) are useful for setting expectations. As of late 2026 the top-tier of the arena is led by Inworld TTS 1.5 Max at an ELO of around 1,210, with Gemini 3.1 Flash TTS close behind at 1,206 and Eleven v3 at 1,178. These rankings move month to month and reflect blind-test naturalness, not any particular use case.What the leaderboard does not tell you is which model handles your script's specific quirks: your accent, your code-switching, your numbers, your tone shifts, your character cap. That is what the test workflow above is for. Use the leaderboard to seed candidates, not to make the final call.### What to rememberThe eleven-model picker is overwhelming only if you treat the eleven options as roughly equivalent. They are not. Real-time work belongs to Eleven Flash and Inworld Mini. Long-form Mandarin belongs to MiniMax or Qwen3. Multi-speaker dialogue with real laughter belongs to Dia. Voice cloning belongs to Qwen3 Base or MiniMax. Generic studio narration belongs to Eleven Multilingual v2. Cheap inline-tag short copy belongs to xAI. The new emotional ceiling belongs to Eleven v3.Three filters, sixty seconds of script picking, ninety seconds of side-by-side testing. Then ship.::blog-tools{slugs="ai-tts voice-cloning ai-audio-to-audio" columns="3"}::https://z.tools/t/ai-tts