Dia 1.6B and the case for dialogue-first text-to-speech
Dia 1.6B from Nari Labs is the only dialogue-first text-to-speech model on the AI text-to-speech tool. The architectural difference shows up most in non-verbal cues: real laughter and coughs as audio events, not as read-aloud text. Here is when it wins, and when it does not.
If you sit down with the eleven AI text-to-speech models in this tool and run the same script through all of them, ten will produce a narrator. One produces a scene. Dia 1.6B from Nari Labs is the dialogue-first model in the catalog, and the shape of what it generates is different from everything around it. Two voices alternating across a conversation. Real laughter when you ask for it instead of the word "haha". Coughs and sighs and gasps that show up as audio events instead of as text the model awkwardly says aloud. The output sounds like a recording of two people, not a recording of one person reading a script that has two characters.
This piece is the case for reaching for Dia when your project needs that, and an honest reading of where it falls short for everything else.
ai-ttsWhat makes Dia different from a narrator-style TTS
The standard architectural choice in most modern text-to-speech is: train one or many voices, give the user a way to pick which voice they want, and read the text in that voice. Multi-speaker support, when it exists, is bolted on through tag conventions or speaker IDs.
Dia inverts that. The model is dialogue-native. The training signal was multi-speaker conversation, not single-voice narration. The speaker tags [S1], [S2], and so on are first-class citizens of the input format, not metadata layered on top of a narrator. Pass it a script with two speakers and the model produces two voices that sound related to one another (like a real podcast hosting pair) rather than two arbitrary samples from the same voice library glued together.
The other architectural choice that distinguishes Dia is its handling of non-verbal cues. In Dia's training data, parenthetical events like laughter and coughing were paired with actual audio of those events, not with the spoken text. The model learns that (laughs) is a cue to produce laughter audio. Most narrator-style models, by contrast, were trained on text that described those events, so when you write (laughs) they tend to read it aloud or substitute a flat audio approximation.
Two architectural choices, one big behavioral difference. Anything that sounds like a recording of two real people having a conversation has a much better chance of working in Dia than in any of its narrator siblings.
The model itself is a 1.6-billion-parameter open-source release under Apache 2.0 license, shipped by Nari Labs (a small team founded by two South Korean undergraduates) in April 2025. The full model needs roughly 10 GB of VRAM to run if you self-host. In the AI text-to-speech tool it runs on managed infrastructure at around $0.015 per 1,000 characters with a 3,000-character cap per request.
The "actually laughs" thing
The most concrete way to see the difference between Dia and a narrator model is to write a script that ends in a non-verbal beat and run it through both.
Take a line that ends:
Did you really just outrun three drones?
(laughs)
Run that through Eleven v3 with its emotional voice library, and the output is impressive but the closing beat is approximated. Listeners describe it as the voice doing a quick exhale-with-amusement that sounds like the model trying to render the spirit of laughter without ever crossing into actual laughter.
Run the same line through Dia, and the closing beat is laughter. Real, audible, slightly imperfect-in-a-human-way laughter. The line ends, the speaker laughs for half a second, the audio fades out.
This is not a small distinction. For a comedy podcast, an audio sketch, or any project where the audience listens for the ad-libbed reaction, the difference between "rendered approximation" and "actual laughter" is the difference between feeling like a polished recording and feeling like an over-engineered narrator. The Nari Labs team has made the comparison explicit on their own demo page, pitting Dia against ElevenLabs Studio and Sesame's open CSM-1B model, and the gap on non-verbal beats is the most consistently visible delta.
The same holds for the rest of the documented Dia non-verbal vocabulary: (sighs), (coughs), (gasps), (clears throat), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (humming), (chuckle), (whistles). The reliability varies by tag (the model card warns that some tags produce unexpected output), but the documented set covers most of what a podcast script needs.
When Dia is the right answer
The model is genuinely good for a specific shape of project, and if your project fits that shape, nothing else in the catalog matches it. The shapes:
- Podcast scenes with two or three hosts. Especially conversational shows where the laughter, the back-and-forth interjections, and the timing between speakers carry the scene. A script written as
[S1]and[S2]exchanges with parenthetical reactions reads like a transcript and generates like a recording. - Audio drama and radio plays. Short scenes where the dialogue is the medium. Dia's character cap (3,000 characters per request) is fine for most scene lengths and the model generates both speakers with consistent voice identity across a single generation.
- Comedy bits and sketches. The non-verbal cues are the joke a lot of the time. A model that can render
(coughs)as actual coughing makes a much better foil for written comedy than a model that has to write around it. - Audiobook dialogue scenes. Not whole audiobooks (too short and English-only) but the dialogue-heavy sections that an Eleven Multilingual v2 narrator handles awkwardly. Hybrid workflows where Dia generates the conversation scenes and Eleven generates the surrounding narration produce better-sounding audiobooks than either model alone.
- Animation and game prototyping. Cheap, fast, multi-character voice generation for storyboards, animatics, and early-iteration scene work. The output is good enough for prototype, the cost is negligible at iteration scale, and the speaker tags translate naturally to the speaker structure of any animation script.
The pattern: any project where two or more characters are talking, where the silence and reactions between lines carry the scene, and where the script lives in English. If your project is any of those, Dia is not just a sibling in the catalog. It is the only model that does the job.
When Dia is the wrong answer
The boundaries are also clear, and Dia is not the right pick for most of what people use a TTS for.
Long-form narration in a single voice. The 3,000-character cap and the dialogue-first training make Dia genuinely bad for sustained narrator work. A 90-minute audiobook, a 30-minute course module, a 20-minute corporate explainer: those are jobs for Eleven Multilingual v2 or Inworld 1.5 Max, not Dia.
Anything in a language other than English. This is the most common reason a project that would otherwise fit Dia ends up using a different model. The Nari Labs team has not announced multi-language support and has been candid that English-only is the current limit. If your script is in Mandarin, Spanish, French, Hindi, Japanese, or any other language, Dia is out of the running before you start.
Real-time voice agents. Dia is not a low-latency model. The training, the architecture, and the use case are all built around scene generation, not turn-by-turn conversation. For chatbots and voice agents, reach for Eleven Flash or Inworld Mini.
Voice cloning of a specific person. Dia supports voice cloning through audio prompting, but the workflow is rougher than the dedicated cloning models in the catalog. If you need to clone an individual voice from a short reference, Qwen3-TTS Base is more direct.
Production scripts that need a specific brand voice. Dia picks the voices for [S1] and [S2] based on the input pattern; it does not give you fine-grained control over which voice it uses. If your project requires a specific licensed voice or a consistent brand voice across episodes, Eleven v3's voice library gives you control that Dia does not.
The framing that has worked for me: Dia is the right answer when the shape of the script is "two or more people talking" and the wrong answer when the shape is "one person reading text".
A worked podcast scene workflow
Once you have a project that fits Dia, the workflow that produces good audio looks like this:
- Write the scene as a conversation script. Use
[S1]and[S2](and[S3]if a third speaker enters) consistently. The Nari Labs docs are clear: always start with[S1], alternate, and avoid two consecutive same-speaker tags. Ending the script with the second-to-last speaker tag improves the closing audio quality, per the docs. - Add non-verbal cues where they belong, naturally, in parentheses inline with the dialogue. Resist the urge to over-tag. A laugh on every other line is a tell.
- Generate the scene. Listen end-to-end. The first generation will catch most of the issues.
- If a non-verbal beat does not land (a
(coughs)that comes out flat, a(gasps)that gets skipped), regenerate with that line slightly rephrased. Some non-verbal tags are more reliable than others, and small tweaks to surrounding context often shift the output. - For longer projects, generate scene by scene rather than as a single 3,000-character block. The voice identity stays consistent across requests within reason, and shorter generations produce cleaner output than scripts that approach the cap.
- If you need the surrounding narration in a different voice (a narrator's intro, scene-setting voiceover, an outro), generate those separately in Eleven v3 or Eleven Multilingual v2 and edit them together in Audacity, Reaper, or any digital audio workstation. The hybrid workflow produces audio that sounds polished without relying on any single model to do everything.
This is roughly the workflow that turns a script into a finished podcast scene in about thirty minutes.
What open-source means in this context
Apache 2.0 is the license. The model weights are on Hugging Face. The inference code is on GitHub. The team publishes its own demo page comparing Dia to ElevenLabs and Sesame, and the comparison is open enough that you can verify it yourself.
For most users of the AI text-to-speech tool, the open-source license is interesting but not load-bearing. You are running Dia through managed infrastructure inside the tool, and the experience is identical to using any of the other models. The open-source nature matters in three specific cases:
First, you can self-host. If your project requires that audio never leave your infrastructure (regulated industries, sensitive content, contract requirements), Dia is the only model in the catalog you can pull down and run on your own GPU, with no vendor in the loop.
Second, the model will not be deprecated by a vendor decision. ElevenLabs has already deprecated Turbo v2.5 in favor of Flash v2.5; that kind of model lifecycle is normal for managed APIs. Dia is permanent in the sense that the weights you have today will still run identically in three years.
Third, you can fine-tune. The training code is not fully open in the same way the weights are, but the inference code and model architecture allow advanced users to extend or specialize the model. For a small studio that wants its own permanent voice catalog, this is unique among the eleven.
For the bulk of projects, none of this matters. For the projects where it matters, it matters a lot.
What I take from working with Dia
The eleven-model catalog is not a contest where one model wins. Dia is the clearest example. By every conventional benchmark (language coverage, voice library, character cap, latency, leaderboard naturalness) Dia is not the top model. By the specific test of "does this generate a scene that sounds like two people having a conversation, with real laughter when the script asks for it", Dia is alone.
That kind of model-shaped specialization is the right reading of the multi-model catalog in general. You are not picking the best model. You are picking the model whose shape matches the script's shape. For dialogue, scene work, and non-verbal-rich English audio, the shape that matches is Dia. For everything else, look elsewhere in the catalog.
The good news: you can write the scene in your own voice, generate it in Dia, listen back in three minutes, and have a finished bit of audio that no other model in this catalog could have produced.
Page Not Found · Z.Tools
The page you're looking for doesn't exist or has been moved.
