Text-to-speech pricing math for long scripts and small teams

The pricing line on a TTS tool reads like a footnote: "credits per 1,000 characters." For a one-paragraph product description, the credit cost is small enough to ignore. For an audiobook, a course library, or a podcast that runs forty episodes a year, the cost adds up faster than the character-rate suggests.

This is the math, the pricing levers small teams can actually pull, and where the synthetic-voice line item compares to the alternatives. No vendor pitch, just the numbers.

How character-based pricing converts to time

The catalog charges per 1,000 characters. To translate that into something your project planning can use, work backward from the audio duration.

Average English narration runs around 150 words per minute. A word averages about five characters, plus spaces and punctuation, so per-minute character count is roughly 800–950 characters in continuous prose. Round to 900 characters per minute as a working estimate.

That gives some useful thumb-rules:

A 60-second clip ≈ 900 characters.
A 5-minute video voice-over ≈ 4,500 characters.
A 30-minute podcast episode ≈ 27,000 characters.
A 60-minute audiobook chapter ≈ 54,000 characters.
A 10-hour audiobook ≈ 540,000 characters.

For Mandarin, the math runs differently. Each character in Mandarin is roughly one syllable, and Mandarin TTS produces audio at around 250 characters per minute. A 30-minute Mandarin podcast is closer to 7,500 characters than to 27,000.

For other languages, the per-minute rate sits somewhere between the English and Mandarin cases. Romance languages with longer vocabulary average per-minute character counts somewhat higher than English; agglutinative languages like Finnish or Turkish sometimes slightly lower. For planning purposes, the English thumb-rule of 900 characters per minute works as an upper-bound estimate for most languages, and the Mandarin thumb-rule of 250 characters per minute works as a lower-bound for character-script languages.

Project-by-project cost shape

Most TTS work falls into one of a few project shapes. The pricing differs more by shape than by per-character rate.

One-shot small jobs (a 30-second app announcement, a one-paragraph product description, a short ad spot). Single-digit-thousand character counts. The cost is functionally a rounding error against any production budget the project has. Do not optimize.

Recurring small jobs (a podcast intro, a weekly news brief, a subscription email's audio version). Single-digit-thousand characters per instance, multiplied by the cadence. A 600-character intro generated once per week is 31,200 characters per year. Still small in absolute terms, but worth setting up the production pipeline to avoid re-generating the same intro from scratch every week, generate once, reuse the file, regenerate only when the script changes.

Long-form single jobs (an audiobook, a course module, a documentary voice-over). Tens to hundreds of thousands of characters in a single production cycle. Per-character cost adds up. A 540,000-character audiobook costs measurably more than a 27,000-character podcast episode. Worth planning the script with the per-character cost in mind: trim, do not re-generate the whole project to test variations, master the audio so you do not re-generate from scratch when small fixes are needed.

Recurring long-form jobs (a course library being expanded continuously, an audiobook publisher producing multiple titles per year, an educational platform with new modules monthly). Cumulative character counts in the millions per year. The per-character cost matters, and the production-pipeline efficiency matters more. Tools and templates that let you regenerate just the changed segments instead of the whole module are where the savings come from.

Voice-agent or live-product applications (an IVR system, a real-time voice agent, a streaming-audio product). Character volume is unpredictable but tends to be high under load. The per-character economics matter, and the latency economics matter more. A voice agent's cost is the per-character generation cost plus the operational cost of the latency budget; both are real.

A bar chart showing five project shapes (one-shot small, recurring small, long-form single, recurring long-form, voice-agent) with their typical annual character volumes plotted on a log scale. Each bar is annotated with a one-line note on which lever moves the cost most for that shape

The four levers that actually change cost

For small teams, four levers move the per-project total more than picking a different vendor at a slightly different per-character rate.

Lever one: trim the script before generating. Spoken content benefits from being shorter than its written equivalent. A blog post you would publish at 1,800 words usually narrates better at 1,200, the narration cuts the throat-clearing, the rhetorical asides, and the parenthetical detours that work in written prose but feel padded in audio. Trimming a 27,000-character script to 18,000 characters cuts a third of the cost without compromising the listener experience. Do the editorial pass before you generate.

Lever two: regenerate only the changed segments. When a script changes (a name correction, a sentence revision, a new sponsor read), do not regenerate the whole project. Regenerate just the affected segment, splice it into the existing audio in your editor. The infrastructure cost of editing audio in place is small; the savings on regeneration are real. For an audiobook chapter where one paragraph needs a fix, regenerating one paragraph instead of an hour of audio is the difference between a few cents and a few dollars per fix, multiplied across the maintenance lifetime of the content.

Lever three: generate at archive quality once, encode for delivery from the master. The TTS generation is the expensive step; re-encoding to a different format is free. Generate the master in WAV or FLAC, master to your loudness targets once, and encode lossy delivery files (MP3, AAC, Opus) from the master as needed. If you ever need to deliver the same content in a different format or at a different bitrate, you do not pay for regeneration; you encode from the master. The savings compound across long-term projects.

Lever four: pick the right voice the first time. Re-generating a project after discovering the wrong voice was picked is the single most expensive avoidable cost in TTS production. The casting worksheet is a fifteen-minute investment that saves hours of regeneration if the cast was wrong. For projects over five minutes of audio, get this right before generating the bulk.

Where TTS sits against the alternatives

For comparison, the order-of-magnitude alternatives a small team is choosing among.

Hire a human narrator on demand. Professional narration runs roughly $100 to $400 per finished hour for non-celebrity narrators in the U.S., higher for high-profile names or specialized work, lower in lower-cost markets. A 30-minute podcast episode runs $50 to $200 in narrator fees, plus studio time and editing if you do not have those in-house. The narrator delivers usable audio without further work; you do the post-production. The cost is real, the quality is high, and the human-narrator profile fits content where voice quality is part of the product.

Subscribe to a premium TTS service with cloning. Subscription tiers on the higher end of the market start at low double-digit dollars per month and scale into the hundreds for production volume. The per-character cost on subscription services is often competitive at scale; the friction is the subscription itself, you commit to a recurring expense whether you use it heavily that month or not.

Use the per-credit catalog. The per-1000-character credit pricing scales with use. You pay for what you generate. A 30-minute podcast episode is in the single-cent range for the synthesis itself. A 10-hour audiobook is in the dollar range, not the tens of dollars range. For irregular use (a few projects per quarter rather than continuous output), per-credit pricing typically wins against subscription pricing; for continuous heavy use, the per-credit math often still wins on absolute terms but the predictability of a subscription can be worth the small premium.

Self-host an open-weight model. Models like Kokoro-82M are Apache-licensed and run on commodity hardware. The infrastructure cost is the GPU or CPU instance you run, plus the developer time to set up and maintain the pipeline. For a small team without a dedicated infra person, this is rarely the right call; for a team with infra capacity and predictable high volume, it can be the cheapest path. The hidden cost is reliability, model updates, and the team time to maintain it.

For most small teams in 2026, the per-credit catalog is the right path: pay for what you use, do not commit to a subscription, do not run infrastructure, get production-quality voices at predictable per-character pricing.

A worked example: a weekly podcast

Take a podcast that publishes weekly with a 30-second TTS intro, a 30-second TTS outro, and (occasionally) a 60-second sponsor read using TTS. Per episode:

Intro: ~450 characters.
Outro: ~450 characters.
Sponsor read: ~900 characters when present (assume present in 30 of 52 weeks).

Per year: 52 × 900 = 46,800 characters for intro/outro + 30 × 900 = 27,000 characters for sponsor reads = ~73,800 characters total.

At per-1000-character pricing, the annual cost of TTS for the entire podcast intro/outro/sponsor production is in the low single dollars to low tens of dollars, depending on the exact rate. Compared to the cost of recording these in studio with a voice actor (a few hours of studio per year plus the actor's time, typically several hundred dollars per year for the same volume), the TTS path is dramatically cheaper. Compared to recording yourself (free, plus your time), TTS is more expensive but more consistent and faster to update.

The math here is not surprising. What surprises producers is how much of their production budget on traditional paths went to recording the smallest, most repetitive parts of the show, the parts that TTS handles cleanly. Putting those parts on TTS frees the production budget for the parts that genuinely need a human voice.

A worked example: an indie audiobook

A 10-hour novel is roughly 540,000 characters of narration. At per-1000-character pricing in the typical OpenAI-compatible-provider range (around $0.005–$0.015 per 1,000 characters), the synthesis cost for the audiobook is in the low single-dollar range, not the tens or hundreds. ElevenLabs Flash and similar premium tiers run higher, sometimes 5–10x; even those land at well under $100 per 10-hour audiobook.

That is not the comparison. The comparison is the cost of the alternative production paths. A human-narrated indie audiobook of the same length runs in the low thousands of dollars at amateur-to-mid-tier rates, higher with experienced narrators, much higher with profile names. The synthesis cost is small fraction of the human-narrator path.

The trade-off is the result. The synthesis-narrated audiobook may not be approved for distribution on the largest audiobook retailer (per the platform-specific policies discussed elsewhere); the human-narrated audiobook does qualify and reaches a larger audience. Plan the production economics knowing the distribution constraints.

For most indie authors in 2026 with limited budgets and willingness to publish on alternative-distribution channels, the synthesis path is economically viable. For authors targeting Audible-equivalent audiences, the synthesis cost savings are not real because the resulting audiobook does not reach the audience.

A two-row comparison: row one shows the synthesis cost for a 10-hour audiobook (dollars), row two shows the human-narrator cost for the same audiobook (low thousands). Annotations highlight that the cost gap is real but the distribution channels differ, the comparison is not apples-to-apples without naming the channel constraint

The summary number, briefly

For most small-team projects, TTS at per-credit pricing is the cheapest production path that produces shippable audio. The per-character pricing is small enough to ignore for one-shot work and predictable enough to budget for recurring or long-form work.

The actual cost levers, script length, regeneration discipline, master-and-encode workflow, voice cast quality, matter more than the per-character rate at any reasonable provider. Small teams that internalize those levers ship more audio for less money than teams that chase the lowest per-character rate without a production process behind it.

The pricing math is favorable. The production discipline is what turns favorable pricing into actually-cheaper-per-finished-asset.

Text-to-speech pricing math for long scripts and small teams