Where xAI Text-to-Speech fits: cheap tags, short scripts, fast experiments

xAI Text-to-Speech ships five voices, twenty languages, and the cheapest per-character price in the AI text-to-speech catalog. The shape of work it wins on is short, prosody-controlled, single-speaker scripts. Here is when to reach for it and when to skip it.

Og Image

xAI shipped its standalone Grok text-to-speech API in April 2026, alongside the speech-to-text counterpart, and made it available through partners shortly after. Inside the AI text-to-speech tool, it sits at the top of the model picker and almost no one tries it. The naming does not help: most users who have heard of xAI think of Grok the chatbot, not Grok the TTS. The price tag does not help either: at $4.20 per million characters, the model is cheap enough that some users assume "cheap" means "the budget option that is fine for prototypes and bad for production".

That assumption is wrong. xAI Text-to-Speech is genuinely good for the specific shape of work it is built for, and that shape is wider than the price tag suggests. This piece is the case for putting it in the rotation, and the honest read of where it falls short.

What it actually is

xAI Text-to-Speech ships with five named voices, each with a short personality direction baked into the voice. The catalog: Eve (energetic, upbeat), Ara (warm, friendly), Rex (confident, clear), Sal (smooth, balanced), and Leo (authoritative, strong). The voices are the same ones that drive the speech responses in the Grok mobile app, Tesla in-vehicle assistant, and Starlink customer support. The training data is the same. The voices are tested at scale.

The pricing on the underlying xAI API is $4.20 per million characters, which works out to $0.0042 per 1,000 characters. That makes it the cheapest model in the AI text-to-speech tool by a meaningful margin. The next-cheapest tier (Qwen3-TTS, Inworld Mini) is roughly four to six times more expensive per character. The most expensive tier (Eleven v3) is about forty times more.

The character limit on the direct API is 15,000 characters per request. The AI text-to-speech tool exposes 8,000 in its registry, which is still ample for most short-form work. Output formats: MP3, WAV, PCM, μ-law, A-law. The μ-law and A-law options are the give-away that xAI built this for telephony as well as media.

The inline-tag system is where xAI shows its hand. Discrete inline tags inside square brackets handle short events: pauses, breath, laughter. Wrapping tags inside angle brackets style spans of text: whisper, slow, soft, emphasis. The two compose. You can pause for half a second, then deliver the next clause in whisper voice, then pause again, all within one input.

Where xAI Text-to-Speech actually wins

The shape of project that wins on xAI:

Short-form ad reads, social-media voiceover, and burst content. A 30-second product spot, a 60-second YouTube intro, a Reels or TikTok narration, a podcast bumper. Anything that fits in 8,000 characters and benefits from inline prosody control. The five voices cover the most common ad-read profiles (warm narrator, confident announcer, friendly conversational), and the inline tags let you direct the read precisely on a few key beats.

Storytelling that wants pauses and emphasis without theatrics. A short fiction read, a personal essay narrated for a podcast, an audio postcard. xAI is not Eleven v3 for emotional acting, and it is not Dia for dialogue, but for narration that wants timing and stress placed deliberately, the wrapping tags give you control that is hard to achieve in a model that does not have them.

Fast experiments and iteration. The price means you can generate twenty variations of a script in different voices for a couple of cents. For brainstorming, casting tests, or rough cuts, the cost-per-iteration is low enough that the friction disappears. Most of the time, two or three iterations are all you need to find the voice that fits, and xAI's price makes those iterations basically free.

Telephony and IVR work. The μ-law and A-law output formats are not in every model in the catalog. For phone-tree systems, voice agents that cross the public phone network, or any project that needs to feed audio into telecom infrastructure, xAI is one of the few models that produces the right format directly without a transcode step.

Multilingual short copy in well-supported languages. xAI's 20-language list covers most of the dominant global languages. For a short ad in Spanish, French, German, Japanese, Korean, or Mandarin, xAI generates competent output cheaply. It is not the leader on any single language, but for short scripts in well-supported languages, the price-per-quality ratio is hard to beat.

The framing: when the script is short, the budget matters, and you want prosody control more than emotional theatrics, xAI is the right pick.

A two-column fit-card titled "When xAI Text-to-Speech wins". Left column lists four short bullets (short ad reads, storytelling with pauses, fast experiments, telephony output formats). Right column lists three caveats (not for long-form narration, not for emotional acting, not for voice cloning). Editorial slate-and-cream palette, no vendor logos

Where xAI Text-to-Speech does not win

The honest read of the boundaries:

Long-form narration. The 8,000-character cap on this tool is fine for short copy and uncomfortable for long-form. A 30-minute audiobook chapter is not what xAI is built for, even ignoring the cap. The voice library is small (five voices) compared to the alternatives, and the model does not produce the same kind of sustained narrator presence that Eleven Multilingual v2 or Inworld 1.5 Max give you.

Emotional voice acting. The audio-tag system is for prosody (timing, emphasis, register), not for emotional state. There is no [laughs] or [sighs] tag in the xAI library. If your script depends on the model producing a laugh, a sob, a sarcastic line read, you want Eleven v3.

Multi-speaker dialogue. xAI is a single-speaker model. The five voices are picked individually; there is no convention for marking speaker changes mid-input. For dialogue, reach for Dia.

Voice cloning. Not supported. If your project needs a clone of a specific voice, the cloning models in the tool (Qwen3-TTS Base, MiniMax Speech 2.8) are the only paths.

Highest-quality multilingual broadcast. xAI is good at the major languages. For broadcast-grade Mandarin, MiniMax Speech 2.8 HD wins. For 70-language coverage, Eleven v3 wins. For dialect-aware Chinese, Qwen3 wins. xAI's strength on multilingual work is "competent and cheap", not "best in class".

The pattern: xAI is the model for short, prosody-controlled, single-speaker work in major languages, at the lowest price in the catalog. It is not the model for everything else.

A worked example: a 200-word product spot

To make the case concrete, take a 200-word product spot. Standard ad-read content, conversational tone, one beat where the script lands a key product claim, one closing line.

In xAI Text-to-Speech, written with inline tags:

Welcome back. [pause] Today we are talking about something that, honestly, has changed how I work. <emphasis>The new noise-cancelling headphones from Acme.</emphasis> [pause] They cancel ambient sound <slow>better than anything I have tested.</slow> And at this price, that is genuinely surprising.

Generated with the Sal voice (smooth, balanced) at default settings, the output lands the brand mention with emphasis, slows the technical claim for clarity, and uses the pauses to let the listener absorb each beat. Total cost for that 200-word generation: under one cent. Total time from open-tool to finished audio: under three minutes including the script tweaks.

The same script in Eleven v3 with audio tags would produce a more emotionally inflected read at roughly forty times the cost. The same script in Eleven Multilingual v2 would produce a flatter, more neutral read at roughly thirty-five times the cost. The same script in Inworld 1.5 Max would produce a more naturalistic read at roughly six times the cost. For some projects (a high-budget brand ad, an audiobook chapter, a flagship podcast intro) those alternatives are worth the upgrade. For most short ad-read work, xAI produces output that is good enough at a price that lets you iterate without thinking about it.

Why it deserves a slot in the rotation

The argument for keeping xAI Text-to-Speech in your default rotation, even if it is not your primary model:

The price disappears as a constraint. When iteration is close to free, the friction between "draft" and "ship" goes down. Concept tests, voice casting tests, and quick variations all become feasible in a way they are not on the more expensive models.

The inline tags compose more cleanly than most. The combination of bracket-style discrete events and angle-bracket-style wrapping spans is more flexible than either system alone. For prosody-driven content, this is the easiest catalog model to direct precisely.

The five voices cover the most common job profiles. Warm narrator, friendly conversational, confident announcer, smooth presenter, authoritative speaker: that is the spine of most short-form voice work. If you are not casting for a deep-bench library, the five voices are enough.

The telephony output formats matter for projects that need them and are missing on most alternatives.

xAI is not the right model for everything. It is the right model for more things than the price suggests, and the underweighting it gets in most multi-model rotations is the gap that this article is trying to close.

What I do

When a script lands on my desk, the first question is "what shape of work is this". If the answer is "short, single-speaker, major language, prosody-driven", xAI is the first model I reach for. Generate, listen, iterate twice if needed, ship. Time from script to finished audio: usually under five minutes.

If the script is anything outside that shape (long-form, emotional, multi-speaker, dialect, broadcast), xAI is not the answer and I move on to the right model in the catalog. The five-voice limit and the price tag both stop being the relevant variables.

The takeaway is small but useful: do not skip the cheapest model in the catalog. For a specific shape of work, it is the right pick, and underusing it costs more than the alternatives' subscription bills suggest.

AI 声音克隆

AI 声音克隆

克隆任意声音,生成多语言语音

继续阅读