Picking a real-time voice agent TTS: Eleven Flash vs Inworld Mini

Eleven Flash v2.5 and Inworld TTS 1.5 Mini both meet the sub-250ms target for natural voice-agent turn-taking. The right pick is less about latency on paper than about voice variety, language fit, and cost at volume.

Og Image

If you are wiring a voice agent that needs to start speaking within a quarter of a second of the user finishing their turn, exactly two models in the AI text-to-speech tool make the cut: Eleven Flash v2.5 and Inworld TTS 1.5 Mini. Everything else in the catalog generates audio too slowly to feel like conversation. Picking between these two is the question this piece is about.

The conventional framing is "which one is faster". The published numbers say Inworld Mini reports under 130 milliseconds at the 90th percentile time-to-first-audio, while ElevenLabs reports around 75 milliseconds of model inference time on Flash. Read at face value, Flash wins. Read carefully, the numbers are measuring two different things, and the right pick is closer to a tie than the headline suggests. Once you separate "model inference time" from "what your user actually experiences", the decision turns less on latency and more on three other axes.

What the published numbers actually mean

ElevenLabs publishes the 75-millisecond figure for Flash with explicit caveats in their own latency documentation. It is a model-only inference number, internal measurement, on representative short inputs. Their docs spell out the five-stage end-to-end latency pipeline a real voice agent runs through: network round-trip (20 to 200 milliseconds depending on geography), server processing (single digits), model inference (about 75 milliseconds), audio player buffering (typically 500 milliseconds), and application overhead from the upstream LLM and speech recognition stages. The 75-millisecond number is one stage of five.

Inworld's 130-millisecond figure on Mini is a P90 time-to-first-audio measurement, which means it is closer to what an end user would perceive on a typical request, but Inworld also does not publish detailed methodology for the measurement: which region, which network, which voice, which input length. Both numbers are useful. Both numbers are partial.

The honest comparison is closer to this: both models hit "natural conversational pacing" reliably under typical conditions, both give you sub-250-millisecond end-to-end latency in production when configured correctly, and the difference between them is inside the noise floor of network variance and audio-buffering choices. Configure either one badly and end-to-end latency balloons past 600 milliseconds. Configure either one well and the user does not notice the difference.

A horizontal pipeline diagram with five stacked stages from left to right showing the contributors to end-to-end voice-agent latency: network round-trip (20 to 200 milliseconds), server processing (a few milliseconds), model inference (75 to 130 milliseconds), audio player buffering (around 500 milliseconds), and application overhead from upstream stages, with a callout under the model-inference stage noting that this is the only number both providers publish loudly

Where Flash v2.5 actually wins

Eleven Flash is the most mature of the real-time-tier TTS models. Some of that maturity translates directly into agent quality once you get past the latency-on-paper question:

  • Voice variety. ElevenLabs' voice library is the largest and best-curated in the industry, and Flash inherits the full library. If your agent needs a specific brand voice, an existing licensed voice, or quick switching between presenter, customer-success, and announcer registers, the depth of the Eleven catalog is hard to beat.
  • Language coverage. Flash supports 32 languages. Inworld Mini supports 15 (it added Hindi in the 1.5 release). If your agent serves users in Polish, Vietnamese, Tamil, or anything outside the most-spoken dozen, Flash is the more reliable bet.
  • Streaming maturity. ElevenLabs has shipped voice agents in production for two years. Their streaming docs, WebSocket guidance, chunk-size recommendations, and the surrounding ecosystem of integrations (LiveKit, Pipecat, Twilio adapters) are dense and well-tested. Inworld's ecosystem is growing, but Eleven's is the path of least surprise.
  • Reliability under variable input. Flash handles longer inputs (up to 40,000 characters), which matters less for a chatbot turn but matters a lot if the same model is also used for the post-call summary or the transcript playback.

The trade-off: Eleven's pricing on Flash on the underlying API runs around $0.05 per 1,000 characters, which is meaningfully more expensive than Inworld's $15 per million-character pricing on Mini. For a high-volume agent serving millions of turns, the price difference adds up.

Where Inworld Mini actually wins

Inworld's case is not subtle. The model is newer, faster on its published P90 number, and priced specifically to undercut the incumbents:

  • Lower P90 end-to-end latency. The under-130-millisecond figure on Mini is the most aggressive published number in this tier. If your agent is competing on perceived snappiness (for example, a customer-service bot where the user is stressed and watching a clock), Mini gives you headroom that Flash does not.
  • Better naturalness on the leaderboard. Inworld's parent model (Max) sits atop the public TTS arena as of late 2026. Mini does not match Max for naturalness, but it inherits the Inworld voice quality direction and is competitive with Flash on most listener-blind tests.
  • Lower price per character. $15 per million on the underlying API translates to $0.015 per 1,000 characters. For high-throughput voice agents, the unit economics make Mini three to four times cheaper than Flash on raw character costs.
  • Hindi support out of the box. Notable for any product targeting the Indian market or English-Hindi code-switched scripts.
  • Native LiveKit, Pipecat, and NLX integrations. The ecosystem is smaller than Eleven's, but the integrations that exist are recent and tuned for voice-agent workloads specifically, with examples and reference implementations.

The trade-off: fewer voices, fewer languages, less production track record, and a smaller ecosystem of pre-built integrations.

A real comparison test that produces a clear answer

The way to decide is not to read more pricing pages. It is to put both in front of a representative call and measure.

A reproducible workflow:

  1. Pull a 30-second representative agent turn from a real conversation log: a greeting, a confirmation question, and a handoff sentence with a number in it.
  2. Generate the same passage in Flash v2.5 and Inworld Mini through your real client, with the same WebSocket configuration and the same chunk size. Run it 20 times and capture end-to-end TTFA from your client logs, not from the API response.
  3. Listen to all 40 outputs blindfolded with one teammate. Have them rank the outputs on naturalness without knowing which model produced which.
  4. Compare three numbers: median TTFA, p90 TTFA, and naturalness rank.

The output of that workflow is usually one of three answers. Either Flash wins on naturalness for your script and Mini wins on cost (in which case the call is which axis matters more for your agent), or Mini wins on both naturalness and latency (in which case Mini is the answer), or both produce indistinguishable user experience and you pick on price and language coverage.

A two-column feature comparison card titled "When each one wins" — left column "Pick Eleven Flash v2.5" lists four short bullets covering voice variety, 32 languages, streaming maturity, and longer-input reliability; right column "Pick Inworld TTS 1.5 Mini" lists four short bullets covering lower P90 latency, lower per-character price, Hindi support, and modern voice-agent integrations; both columns close with a one-line note on the trade-off

When neither is the right answer

Two cases where the right answer for a real-time agent is not in this tool at all.

If your agent needs sub-100-millisecond TTFA in production, Cartesia Sonic 3 reports a 40-millisecond TTFA on its own benchmark and is the model people reach for at the absolute frontier of voice-agent latency. It is not in this catalog, but it is worth knowing about for the cases where Mini's 130 milliseconds is genuinely too slow. Most agents do not need that floor.

If your agent operates entirely in a single language pair where the open-source ecosystem has good coverage (English-only, for example), self-hosting an open model on your own GPUs gives you control over the buffering pipeline that managed APIs cannot. Open-source TTS in late 2026 is good enough for voice-agent quality, and the latency profile depends entirely on your hardware. Most teams do not have the operational appetite for it. The teams that do, win on cost.

For everyone else, Flash and Mini are the two real choices on the AI text-to-speech tool, and the right pick is the one that wins on a 30-minute test against your actual script.

What I do

When I am building a voice agent prototype, I default to Flash because the ecosystem is more mature and I burn fewer hours on plumbing problems. When I am moving to production at meaningful volume, I re-evaluate against Mini because the cost difference compounds. The latency question rarely tips the decision; the language fit, the voice catalog fit, and the unit economics do.

Run the test. Pick the one that wins on your script. Re-test in a quarter. Both vendors are shipping new versions on a quarterly cadence, and the right answer in March is sometimes not the right answer in September.

AI 声音克隆

AI 声音克隆

克隆任意声音,生成多语言语音

继续阅读