The 50,000-character TTS chapter: which models even accept it

An audiobook chapter is 25 to 50 thousand characters. Most TTS models cap at 3,000. Three models in the AI text-to-speech tool accept the long stuff: MiniMax 2.8 (50k), Eleven Flash v2.5 (40k), and Eleven Multilingual v2 (10k). Here is which to pick when.

Og Image

A standard audiobook chapter runs 15 to 30 minutes of audio, which works out to roughly 25,000 to 50,000 characters of text. Most of the eleven AI text-to-speech models in this tool will not take that as a single request. The cap on most is 2,000 to 3,000 characters, which means a single chapter splits into 8 to 25 pieces, each generated separately, with the discontinuities that come at every chunk boundary.

Three models in the catalog accept long requests without that problem. Picking the right one is most of the work for any long-form project that wants to ship without chunking artifacts. This is the short version of the choice.

The three models that actually accept long input

The character caps that matter for chapter-length work:

MiniMax Speech 2.8 caps requests at 50,000 characters. That is roughly 30 to 40 minutes of audio depending on speech rate, more than enough for a single audiobook chapter and most podcast episodes. The cap is the highest in the catalog by an order of magnitude.

Eleven Flash v2.5 caps at 40,000 characters per request. About 25 to 30 minutes of audio, which is also enough for any reasonable chapter. The model was built for real-time streaming, but the high cap makes it practical for long-form too.

Eleven Multilingual v2 caps at 10,000 characters per request. The lowest of the three, but still high enough that a typical 4,000-word audiobook chapter fits in two requests instead of the eight to twenty-five most other models would need.

Everything else in the catalog has caps of 8,000 characters or less, which means three to ten chunks per chapter, with chunk boundaries that the audio engineer has to mask in post.

A horizontal bar chart titled "Character cap by model" showing four bars at scale, with the leftmost being the highest. Bars and labels: "MiniMax 2.8" at 50,000 with the longest bar; "Eleven Flash v2.5" at 40,000; "Eleven Multilingual v2" at 10,000; and a small grouped bar for "Everything else" at "around 3,000". A small footnote reads "An average audiobook chapter is about 25 to 50 thousand characters". Editorial slate-and-cream palette, no vendor logos

When each one is the right pick

The three options split cleanly on three axes.

MiniMax 2.8 wins for Chinese long-form. The 50,000-character cap, the broadcast-grade HD audio, and the deep Chinese voice library together make it the right pick for Mandarin audiobooks, long-form Chinese podcasts, and any 30-plus-minute content where the voice is in MiniMax's strongest languages. For non-Chinese long-form, MiniMax is competent but the alternatives are usually a better fit.

Eleven Flash v2.5 wins for streaming long-form in 32 languages. The 40,000-character cap, the model's real-time streaming maturity, and the broad language coverage make it the right pick when long-form content needs to be delivered as a stream rather than batched. Live audiobook playback, real-time long-form translation, document-to-audio pipelines.

Eleven Multilingual v2 wins for predictable, neutral long-form narration. The 10,000-character cap is lower than the alternatives, but it is enough for one or two requests per chapter, and the prosody predictability is the highest of the three. For audiobook chapters that should sound consistent across hundreds of minutes (educational content, paid audiobooks where the listener will spend hours with the same voice), v2 is the right pick over the higher-cap alternatives.

The pattern: the right pick depends less on raw cap size and more on what the project actually needs from the model. A 50,000-character cap is impressive on paper. A 10,000-character cap with the right voice quality often produces better audio for the same content.

A chapter-splitting recipe that holds up

Even with the highest caps, long-form work usually wants to be split deliberately, not just dumped at the cap and let the model handle it. A working recipe:

  1. Pick chunk boundaries that are content boundaries: scene breaks, paragraph breaks, or sentence boundaries. Never split mid-sentence. Never split mid-word.
  2. Aim for chunks of 60 to 80 percent of the model's cap, not 100 percent. The model has more headroom to handle prosody at the chunk boundary if it does not have to also manage the cap.
  3. Keep the same voice, the same settings, and the same seed across all chunks of the same chapter. Voice and prosody continuity depend on this.
  4. After generation, listen across the chunk boundaries with fresh ears. If the breath placement, pacing, or register shifts at the join, regenerate the affected chunks with slightly different chunk boundaries.
  5. For projects that ship the audio publicly, hand the chunked audio to a real audio engineer for the final pass. They are looking for things that the in-tool listening will miss.

For chapters that fit in one request on MiniMax 2.8 or Eleven Flash, none of this is needed and the audio comes out clean. For chapters that need two or three requests on Multilingual v2, the chunking discipline is what produces output that sounds like one continuous read instead of three different reads stitched together.

What I do for long-form work

The default has settled in for me as: MiniMax HD for Chinese chapter work, Eleven Multilingual v2 for English chapter work where predictability is the priority, and Eleven Flash for streaming long-form in any of its 32 languages. The 50,000-character cap rarely tips the decision; the voice quality and the language fit do.

The takeaway is small but useful: when a long-form script lands on your desk, the first question is "what languages and what voice quality" rather than "what is the highest character cap". The cap is a hygiene factor. The voice is the deliverable.

AI 声音克隆

AI 声音克隆

克隆任意声音,生成多语言语音

继续阅读