Inworld passed Eleven v3 on naturalness: what the leaderboard does and does not prove

Inworld 1.5 Max passed Eleven v3 on the public TTS naturalness leaderboard. The result is real, but it is also being read more confidently than it deserves. Here is what the leaderboard does and does not prove for production teams.

Og Image

For a stretch of 2025, the answer to "which AI text-to-speech model sounds the most human" was the same name everyone reached for first: ElevenLabs. The Eleven v3 alpha showed up in June, became generally available in February of this year, and seemed to settle the naturalness question for a while. Then Inworld released TTS 1.5 Max in late January 2026, climbed the public arena, and stayed there. The current top of the Artificial Analysis Speech Arena, as of late 2026, looks like this: Inworld TTS 1.5 Max at an ELO of around 1,210, Gemini 3.1 Flash TTS just behind at 1,206, Eleven v3 third at 1,178, and the remainder of the top five within a tight cluster.

That is a real movement. It is also being read more confidently than it deserves. A leaderboard like this measures a specific thing under specific conditions, and "the most natural-sounding model in production for your script" is not exactly that thing. This piece is the careful read.

What the arena measures

The Artificial Analysis Speech Arena collects blind pairwise preferences. A visitor lands on the arena, listens to two short speech samples generated from the same input text by two anonymized models, and votes for whichever sounds more natural. The ELO score that emerges is a chess-style ranking, with each model's number drifting up or down based on which competitor it gets paired against and how its samples are received.

That is a useful thing to measure. It is also a narrow thing to measure.

The samples are short. The text is whatever the arena chooses to pipe through, not your script. The voter is whoever happened to be browsing that day, not your audience. The "naturalness" axis collapses prosody, emotion, accent comfort, audio cleanness, breath rhythm, and a dozen other components into a single yes-or-no preference. None of that disqualifies the result. It does mean the leaderboard cannot tell you which model best handles your script's specific quirks.

A horizontal bar chart of the top five models on the public TTS arena as of late 2026, with ELO scores labeled — Inworld TTS 1.5 Max around 1,210, Gemini 3.1 Flash TTS around 1,206, Eleven v3 around 1,178, Inworld TTS 1 Max around 1,165, MiniMax Speech 2.8 HD around 1,164, with a small footnote noting that scores drift over time

How Inworld got there

Inworld launched TTS 1.5 (both Max and Mini) on January 21, 2026, with a few claims that show up in nearly every third-party writeup since: a roughly 4x latency improvement over the prior generation, 30 percent greater expressiveness on internal evaluations, and a 40 percent reduction in word error rate. The company priced it aggressively at $15 per million characters for Mini and $25 per million for Max, which significantly undercut ElevenLabs' multilingual line on a per-character basis.

The Max model started showing up at the top of the public arena within weeks of launch. By March, third-party trackers were quoting an ELO around 1,236 for Max. The score has drifted since (it has been as low as 1,210 in recent reads), which is normal: the arena re-evaluates as new models enter the pool and as voting volume accumulates.

Two things are worth noting about the climb. First, Inworld 1.5 was specifically tuned for naturalness in short conversational samples, which is exactly what the arena measures. The model was built for the test it was about to ace. Second, the company acknowledges its arena win in its own marketing, but is also forthright that the leaderboard reflects "blind comparisons by thousands of real users evaluating which outputs sound more natural and human" without publishing exact vote counts. That is honest framing. It is also the limit of what the result proves.

What a competing benchmark says

If a single leaderboard were the whole story, it would be a strange world. Fish Audio, another TTS provider, published the results of its own 10-day blind A/B test in early 2026: production traffic, more than 5,000 preference pairs from real users on actual scripts. In that test, Fish Audio's S2 Pro ranked first with a Bradley-Terry score of 3.07, nearly 1.7x the next best model in the comparison set. That comparison set included ElevenLabs, Inworld, and MiniMax.

The Fish Audio test is also marketing for Fish Audio. So is Inworld's leaderboard claim. Both numbers are real, both are partial, and they happen to disagree. Which is the right read?

The quiet, useful conclusion: at the top of the modern TTS market in 2026, the gap between the leading three or four models on naturalness is narrow enough that different blind methodologies, different sample sets, and different listener pools produce different winners. None of these tests is wrong. They are measuring slightly different things, and the differences are inside the margin where listener identity and script type start to matter more than model identity.

A two-panel comparison graphic showing the public TTS arena methodology on the left side (short samples, anonymized pairs, public listeners, ELO ranking) and the Fish Audio blind A/B methodology on the right side (production traffic, real scripts, 5,000+ pairs, Bradley-Terry ranking) with neutral framing and no vendor logos, illustrating that two reasonable methodologies can produce different rankings

The practical read for production teams

What the leaderboard is good for: shortlisting. If you have not used any of the current top-five models and you need to pick two finalists for your own evaluation, the public ranking is a defensible starting set. Top three, top five, that is the right altitude.

What the leaderboard is not good for: making the final call. Three things tend to swing the pick once you sit down with real audio:

  • The script's prosody. Long sentences, lists, numbers, and proper nouns do not appear in arena samples but appear constantly in production scripts. A model that wins blind tests on conversational snippets can stumble on a dense paragraph.
  • The script's language. The arena scores naturalness mainly on English. Mandarin, Hindi, Spanish, French, and the long tail of regional languages and dialects are scored less heavily, and the results are scored together. A model that ranks third overall might be first or second on your specific language.
  • The voice you actually want. The arena does not care which voice each model uses to win pairs. You do. A second-place model with a voice that fits your brand beats a first-place model with a voice that doesn't.

A workflow that handles all three: generate a 200-word passage from your real script in the leaderboard's top three models. Listen end-to-end at normal speed. Listen on the device your audience will use, not on studio monitors. Pick the one that handles your script's quirks the best, not the one with the highest ELO. Add ten minutes of testing on top of one minute of leaderboard reading, and the model you ship is the right one.

What I take from the upset

The Inworld result is genuine progress. A two-year-old TTS landscape that revolved around one company has become a four- or five-company landscape in which the rankings move month to month and the gap between the top entries is small. Eleven v3 is still excellent. Inworld 1.5 Max is excellent. Gemini 3.1 Flash TTS is excellent. MiniMax 2.8 HD is excellent for Chinese broadcast work. The ranking changes; the ceiling rises.

For production, my own habit is to keep two or three of the top arena entries on the test bench at any given time, run the same passage through all of them on a regular cadence (every quarter is plenty), and only swap the production model when the alternative meaningfully wins on my specific script. A leaderboard upset is a reason to test, not a reason to migrate.

The arena did not pick your model. Your script picks your model. The arena just told you which two or three to test.

AI 声音克隆

AI 声音克隆

克隆任意声音,生成多语言语音

继续阅读