Audio tags across Eleven v3, xAI Text-to-Speech, and Dia 1.6B: three syntaxes, three results

Three of the AI text-to-speech models on Z.Tools ship with their own inline tag syntax. xAI uses bracket plus angle-bracket spans, Eleven v3 uses square-bracket emotion tags, Dia uses speaker tags plus parenthetical non-verbals. Here is when each one wins.

Og Image

Three of the eleven models in the AI text-to-speech tool let you do more than read text. Eleven v3, xAI Text-to-Speech, and Dia 1.6B each ship with their own inline tag system: special markers you mix into the text and the model interprets as instructions for how to deliver it. Whisper this, laugh here, change speakers, slow that segment down, swap to a soft voice for the next sentence.

The reason this article exists is that the three syntaxes do not interchange. Square-bracket emotion tags from one model fall through silently in another. Parenthetical non-verbal cues from a third produce read-aloud text in the wrong format. The official tag lists are buried in different docs, scattered across help centers and GitHub READMEs, and most third-party tutorials lean on one model and ignore the others. This piece is the working reference: what each system supports, what each one is good at, and what to write when you want a sigh.

Why three syntaxes exist

The simple answer is that each model was developed independently by a different team and they each made a syntax choice without coordinating. The deeper answer is that each syntax reflects what the model was actually built for.

Eleven v3 was built for emotional voice acting and audiobook drama, so its tag system is heavy on emotional states (laughs, whispers, sighs) and short sound effects (gunshot, applause). The square-bracket tags are inline metadata the model reads as performance directions.

xAI Text-to-Speech was built for storytelling and inline prosody control, so its tag system mixes square-bracket short tags with angle-bracket wrapping tags that style longer spans of text. You can pause for half a second and then wrap a clause in whisper voice, all within one input.

Dia 1.6B was built for multi-speaker dialogue with realistic non-verbal cues, so its tag system is split: square brackets [S1] and [S2] mark speaker turns, and parentheses around words like (laughs) and (coughs) mark non-verbal events that produce actual audio rather than spoken text.

The three teams ended up at three different conventions. The result for a writer is that the syntax that works in one model produces silence (or the wrong thing) in another.

xAI Text-to-Speech: inline brackets plus wrapping spans

xAI's text-to-speech API uses two layers of markup. Short inline tags inside square brackets handle discrete events at a position in the text. Wrapping tags inside angle brackets style a span of text. They compose: you can pause, then wrap the next clause in a whisper, then pause again, all within a single input.

Documented inline tags include [pause], [long-pause], [laugh], [sigh], and [breath]. The tag is consumed at the position it appears, with a duration determined by the model. If you want explicit timing, the documentation supports <pause time="600ms"/> for sub-second control of pause length.

Documented wrapping tags include <whisper>...</whisper> for hushed delivery, <slow>...</slow> for slowed pace, <soft>...</soft> for reduced volume and intensity, and <emphasis>...</emphasis> for stressed delivery. The wrapping tags work at clause and sentence scale: you tag a span, the model styles the whole span in that delivery mode.

Working example, the kind of input that produces useful audio in xAI:

Welcome to the observatory. <pause time="600ms"/> The comet streaks across the sky like a silver flame, <emphasis>brilliant</emphasis> and brief.

Read it as a script and the meaning is clear. Generate it through the model and the result is a narrator delivering the observatory line, holding silence for 600 milliseconds, then continuing in a slightly emphasized read on the word "brilliant". Mix in five named voices (Eve, Ara, Rex, Sal, Leo), 20 supported languages, and the cheapest per-character price in the catalog at around $0.0042 per 1,000 characters, and xAI is the model I reach for when a short script needs prosody control without ceremony.

What it does not do: emotional acting in the way Eleven v3 does, multi-speaker dialogue in the way Dia does. The tag set is small, deliberately. The model is for inline prosody, not for theater.

Eleven v3: square-bracket emotional acting

Eleven v3 launched as alpha in mid-2025 and reached general availability in February of this year. The signature feature, beyond a 70-language model and a measurable accuracy improvement on complex text, is the audio-tag system: short square-bracket tags that direct the model to perform an emotion or produce a discrete sound.

The documented tag categories cover three areas. First, vocal delivery and emotional state: [laughs], [laughs harder], [starts laughing], [whispers], [sighs], [exhales], [sarcastic], [curious], [excited], [crying], [snorts], [mischievously]. Second, environmental sounds: [gunshot], [applause], [clapping], [explosion], [swallows], [gulps]. Third, experimental tags that ElevenLabs marks as variable in quality: [strong X accent] (replace X with the desired accent), [sings], and a few others the docs warn about.

Working example for an audiobook scene:

"I should have stayed in the car," she muttered, [sighs] glancing at the dim hallway. "But here we are."

Generated through Eleven v3, the tag produces an audible sigh between the muttered line and the resigned follow-up, with the surrounding prose delivered in the voice's resting register. The same input through xAI produces the words "she muttered, sighs, glancing" because xAI does not recognize the bracket convention.

Two important caveats from the docs and from real use. First, the tag effectiveness depends heavily on the voice the model is using. Eleven's documentation notes explicitly that some tags work well with certain voices and not others, and the experimental tags are marked as such because their behavior is voice-dependent. Test the tag with the voice you intend to ship before assuming it works. Second, the per-request character cap on Eleven v3 in the AI text-to-speech tool is 3,000 characters (the underlying ElevenLabs API allows 5,000), so long scripts with frequent tags run out of space faster than they would on the multilingual sibling.

The model that wins on emotional voice acting also has the strictest length budget. Plan accordingly.

Dia 1.6B: speaker tags plus parenthetical non-verbals

Dia is the third syntax, and the most distinct of the three. The model was released by Nari Labs in April 2025 under Apache 2.0 license and is designed specifically for dialogue. Its tag system reflects that focus.

Speaker tags are square-bracketed identifiers: [S1], [S2], [S3], and so on. The convention from Nari Labs's docs is to always begin input with [S1], alternate consistently between speakers, avoid two consecutive same-speaker tags, and end with whichever speaker tag is not the final one in the script (the docs note this last bit improves output audio quality at the closing of the clip).

Non-verbal tags are parenthesized and produce actual audio events when generated. The documented set is long: (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles). The Nari Labs README warns that the recognized list is broader than the list above but tag effectiveness varies, and some tags can produce unexpected output. The practical floor is that the documented tags work; anything outside the list is at your risk.

Working example for a podcast scene:

[S1] Did you actually outrun three drones? (laughs) [S2] They outran us, they just got bored. (chuckle)

What Dia does that Eleven v3 and xAI do not is render (laughs) and (chuckle) as audible laughter rather than read-aloud text or a flat tag interpretation. In a side-by-side test on a script ending with (laughs), Dia produces actual laughter audio while Eleven v3 produces what listeners describe as a textual substitution. This is the single feature that justifies reaching for Dia over the alternatives when your script depends on real non-verbal sound.

The constraints: English only, 3,000-character cap on this tool, two-to-three-person dialogue scenes are the sweet spot. Long monologue is not Dia's job. Multi-language is not Dia's job. Sound-effect-heavy storytelling is.

A side-by-side worked example

Imagine the same micro-scene rendered through all three syntaxes. The line is:

She paused before the answer, then admitted what she had been hiding for two years.

In xAI Text-to-Speech, you would write:

She paused [pause] before the answer, <slow>then admitted</slow> what she had been hiding for two years.

The pause is consumed at position; the admission is delivered at slowed pace, audibly different from the surrounding narration.

In Eleven v3, you would write:

She paused before the answer, [sighs] then admitted what she had been hiding for two years.

The sigh is rendered as audible breath; the surrounding narration is delivered in the voice's default register, with [sighs] providing the emotional beat.

In Dia 1.6B, you would write:

[S1] She paused before the answer, (sighs) then admitted what she had been hiding for two years.

Dia treats the line as a single speaker ([S1]) with a parenthetical non-verbal event. The sigh is audible. The line is delivered in Dia's default speaker-1 voice, which the model picks based on the input pattern.

Three different syntaxes, three different but acceptable results. Mixing them across models produces silence at best and corrupted output at worst.

Practical guidance: when to reach for which

A short version that holds up:

Reach for xAI Text-to-Speech when your script is short, you want inline prosody control more than emotional performance, and you care about cost. The five voices are competent, the inline tags compose well, and the per-character price is the lowest in the catalog. Good for ads, short-form video, social-media voiceover, and any narration where you want timing and emphasis without theatrics.

Reach for Eleven v3 when your script needs emotional range, multi-speaker delivery in a single voice, or the ability to produce specific non-verbal sounds woven through narration. The audio tag library is the largest, the voice quality is studio-grade, and the model handles the surrounding prose with measurable improvement over its predecessor on complex text. Good for audiobooks, advertisements with character voice acting, e-learning narrators with emotional warmth, and short-form fiction.

Reach for Dia 1.6B when your script is multi-speaker English dialogue with non-verbal cues that have to land as real audio. The speaker-tag convention is rigid (alternate, never repeat), the non-verbal list is the most expressive in the catalog, and the model produces actual laughter where competitors produce a text-like approximation. Good for podcast-style scenes, comedy bits, scripted dialogue trailers, and any English-language content where the silence and sound between words is the point.

Do not mix syntaxes. A single input written for xAI does not work in Eleven v3, and a Dia input passed to either of the others produces broken output. If you are A/B testing the same script across models, write three versions of the script.

What I take from working with all three

The three syntaxes feel like a missed opportunity for an industry standard, but the lack of a standard is also why each model can produce something the others cannot. xAI's wrapping tags are not in Eleven v3 because Eleven v3's emotional model is built around discrete bracketed events rather than span-styled delivery. Dia's parenthetical non-verbals are not in xAI because xAI's model is built for prosody control, not non-verbal generation. The syntax difference is a tell about the model architecture difference.

For working writers and producers, this means three things. Keep a reference card open. Write your inputs once per model rather than copy-paste. Test the specific tag with the specific voice before assuming it works.

Three small disciplines beat one frustrated regeneration after another.

AI 声音克隆

AI 声音克隆

克隆任意声音,生成多语言语音

继续阅读