A five-minute casting worksheet for picking your TTS voice
A short worksheet that shortlists your TTS voice in five minutes by answering five small questions about the script, the audience, and the medium. Designed for first-time users who do not want to scroll a sixty-voice catalog without a plan.
The TTS catalog has dozens of voices. Scrolling through previews until one feels right works, but it takes longer than it should and the result depends on which voice you happened to listen to first. Voice casting is decision work, and decision work goes better with a worksheet.
This is a five-question worksheet that gets you to a shortlist of two or three voices in about five minutes. Use it before you start scrolling, not after.
Question one: what language and accent does the script need
Pick first by language and accent, not by personality. The catalog is organized by language; voices that fit your script's personality but speak the wrong language are not options.
If the script is in American English, start with the American English bench. If it is in British English, start there. If it is in another language, start in that language's bench. Code-switched scripts (mostly one language with English terms) almost always belong in the dominant language's catalog with the script reformatted to handle the foreign-language terms gracefully.
Eliminate anything not in the right language before doing anything else. The catalog gets smaller fast.
Question two: how long is the audio
The duration of the audio shapes the voice you should pick.
Under thirty seconds. Almost any voice works, because the listener does not have time to grow tired of any voice's quirks. Pick by personality fit. This bucket includes intros, outros, ad reads, button taps, error messages, app announcements.
Thirty seconds to five minutes. The voice's pacing and resting tone start to matter. Voices with strong personality (theatrical, playful, conversational) work well; voices that are too crisp or too neutral can sound flat over a few minutes of continuous narration.
Five to thirty minutes. The voice's stamina matters. Some voices that sound great for a minute become tiring over ten because the resting tone is too perky or the pacing is too consistent. Pick from the long-form narrator profile (warm, steady, comfortable at standard speed) or warm conversational profile.
Over thirty minutes. The narrator profile is essential. The voice's resting tone is what the listener hears most; if it is too perky or too clipped, listeners drop off. Test on a representative passage of at least three minutes before committing to a long-form project.
Question three: what register does the audience expect
The audience has implicit expectations about what the voice should sound like. Match those expectations or have a reason for breaking them.
Marketing video for a B2B audience. Crisp, professional, confident. Lean toward conversational presenter or crisp announcer. Avoid character voices, very warm voices, or anything that sounds casual.
Tutorial or how-to. Conversational, friendly, slightly faster than default. Lean toward warm conversational or conversational presenter. The listener is following along and wants the voice to feel like a competent helper, not a formal speaker.
Audiobook draft or long-form narrator. Warm, steady, comfortable at sustained pace. Lean toward long-form narrator profile. Avoid anything that calls attention to itself; the voice should disappear into the content.
Brand voice for ongoing use. Distinctive enough to remember, neutral enough to read any script the brand might publish. This is a hard cast and worth the time. Pick from warm conversational or long-form narrator. Test on three different sample scripts (a marketing line, a tutorial paragraph, a brand-tone customer message) and pick the voice that holds up across all three.
Ad read or commercial. Energetic, confident, with a clear personality. Conversational presenter or crisp announcer for product copy; warm conversational for lifestyle or service copy.
Accessibility playback or screen-reader-style narration. Neutral, clear, even-paced. Crisp announcer or conversational presenter. Avoid character voices, warm-conversational voices that may add unwanted emotional inflection, and any voice with a strong stylistic signature.
Question four: what gender and timbre does the script need
This is two related questions, both worth thinking about explicitly.
Gender. The catalog tags voices as male or female. For most projects, either gender works; pick what fits the brand or the script. The reason this matters is consistency: if you produce a series, alternate gender across episodes thoughtfully (alternating is fine; switching mid-series without reason is jarring). For two-host shows, pick voices with clearly different timbres so listeners can keep track without effort.
Timbre and pitch. Even within one gender, voices vary in pitch range. A higher-pitched female voice sounds different from a lower-pitched female voice; a higher-pitched male voice sounds different from a lower-pitched male voice. The metadata does not always tell you the pitch range, so listen to the previews specifically for pitch fit. Higher-pitched voices read younger and more energetic; lower-pitched voices read older and more authoritative. Pick what matches the script's intended energy.
Question five: what is the test passage
This is the worksheet's only mandatory step that consumes credits.
Pick a representative passage from your actual script, not the opening, not a tagline, not a one-line slogan. The passage should include the typical sentence length the script uses, any proper nouns or technical terms the project relies on, and a sample of the prosodic variety the script will throw at the voice (a question, a longer sentence, an emphatic statement).
A hundred-word passage is enough. Generate it in your two or three top candidates from questions 1–4. Listen to each, ideally on the playback device the audience will use (a phone speaker, a laptop, headphones, whatever fits the project).
Pick the one that holds up. Commit. Generate the rest of the project in that voice with confidence.
Common mistakes to avoid
A few patterns that produce regret.
Casting on a five-second preview. The preview is a shortlist tool, not a casting decision. Voices that sound great in five seconds can grate over five minutes. The full casting decision needs the test-passage step.
Casting on the script's opening. The opening is the easiest passage to read. The voice that sounds fine on the opening may stumble on the harder parts. Test on a passage with the script's actual difficulty.
Casting under the wrong playback conditions. Voices on studio monitors sound different from voices on phone speakers. Cast on the device the audience will use, or at least listen on it once before committing.
Casting too quickly to commit. Spending an extra fifteen minutes on the cast saves hours of regeneration if the cast was wrong. The temptation to "just pick one and move on" is real and usually wrong for projects longer than a few minutes.
Casting once and never re-evaluating. The catalog adds voices over time. The cast that was right last year may not be the best choice now. For long-running projects (a podcast in season three, a course library that gets new modules quarterly), re-evaluate the cast every few months.
When the worksheet returns nothing
Sometimes the answer to "which voice on this provider fits my script" is "none of them, well." That is real. Catalogs are not equally deep in every language, and not every script has a perfect cast in any TTS catalog.
If the worksheet's shortlist of two or three voices all sound wrong on the test passage, the cause is usually one of:
- The script needs a register the catalog does not have (specialized character voice, strong regional accent, very old or very young voice).
- The script needs a language with thin coverage on your chosen provider.
- The script's expectations are pitched at human-narrator quality and no synthetic voice will hit it.
The remedies depend on which cause it is. Rewrite the script to fit the catalog you have. Switch to a different production path — try another provider (ElevenLabs has the broadest coverage; specialized regional providers exist for Mandarin, Korean, Arabic, and others), hire a human narrator, or revise expectations. The worksheet's value is partly that it surfaces these failure modes early, before you have generated thirty minutes of audio in the wrong voice.
The five questions, in order
- Language and accent of the script.
- Duration of the audio.
- Register the audience expects.
- Gender and timbre fit.
- Test passage at production quality.
Five minutes. A shortlist of two or three voices. A confident cast that holds up across the project. The worksheet is small enough to remember, structured enough to give consistent results, and short enough to actually do every time.