Speaker diarization without losing your mind: a practical guide
Speaker labels look magical and fail in predictable ways. Here is a practical guide to when to enable them, how to set the min and max speaker counts, and how to read the results.
Speaker diarization is the part of transcription where the model is asked to figure out, from the audio alone, how many people are talking and label each segment with which person said it. When it works, the transcript reads like a play script with named voices. When it fails, you get four "speakers" for two people, or two "speakers" for a four-person panel, with the labels swapped at random across the file.
This is the practical guide. Less theory, more "what to set when, and what to do when the output is wrong."

音频转文字
将音视频转为文本,支持字幕导出
When to enable diarization at all
Default off. Turn it on only when:
- You have at least two distinct voices in the audio.
- The voices are distinguishable (different gender, different age, different accent, or simply different speaker fingerprints).
- You need to know who said what in the downstream use (interview transcript, panel discussion, multi-host podcast, court deposition, conversational research).
If you are transcribing a single-speaker recording (a lecture, a monologue, a voicemail), diarization is overhead and gives you nothing useful. The model will either correctly label everything as one speaker (best case) or invent a phantom second speaker for an echo or a moment of background noise (worst case).
The min and max speaker count knobs
This is the part nobody explains well. The system has two optional hints: minimum number of speakers and maximum number of speakers. Both default to "let the model decide."
The hints are constraints, not commands. They tell the model: "I am confident the answer is in this range; don't go outside it." The model still does the actual segmentation; the hints just narrow the search space.
How to set them:
Two-person interview, podcast, or 1-on-1 meeting: min 2, max 2. The hint is rigid because you actually know there are exactly two voices. The output is dramatically better than letting the model guess, because the most common error mode without the hint is the model splitting one speaker across two labels.
Three to five person podcast or roundtable: min equals the actual number, max equals the actual number plus one. The "plus one" gives the model headroom for cases where a guest joins briefly or where there is an interviewer asking short questions in a different voice.
Panel discussion, conference Q&A, or anything with audience participation: min equals the panel size, max equals the panel size plus three to five. Audience members count as speakers when the model hears them; you usually want them captured but separated from the panelists.
Courtroom or deposition with a known cast: min equals the cast size, max equals the cast size. The court reporter knows there are four named participants; do not let the model invent a fifth.
Conference call or meeting with unknown attendance: set min to 2, leave max open or set it generously (10+). The model is in detection mode rather than constraint mode.
Recording conditions that affect diarization quality
Diarization is more sensitive to audio quality than transcription itself. The model needs to hear voice fingerprints clearly to distinguish them, and the fingerprints get muddy faster than the words do.
Reliable diarization requires:
- Each speaker captured by their own microphone OR a single high-quality microphone in a quiet space with non-overlapping speakers.
- Speakers who do not interrupt each other constantly. Cross-talk degrades labels fast.
- Voices that sound different. Two adult male voices with similar accents and pitch are the hardest case; different ages, genders, or accents make the model's job easier.
- Recordings without significant background noise, especially without other voices in the background (cafe recordings are nearly impossible to diarize cleanly).
Diarization fails or degrades when:
- Multiple speakers are on a single shared phone or speaker microphone.
- Heavy reverb or room echo causes the same voice to sound like two voices.
- Speakers regularly interrupt or talk over each other.
- One speaker dominates 80%+ of the audio and the others are short interjections (the model often misses the short interjections entirely).
- The audio is heavily compressed or low-bitrate (voice memos at 32 kbps).
Reading the output
Diarization annotations only round-trip cleanly in the detailed JSON output format. In SRT, VTT, or plain JSON, the speaker information is either dropped or smashed into a per-cue tag that varies by tool.
When you read the detailed JSON, each segment will have a speaker label like "Speaker 1," "Speaker 2," etc. The labels are stable within a file (Speaker 1 is always the same person across the file) but not stable across files (the same person in two different files might be Speaker 2 in one and Speaker 1 in the other).
If you need named speakers ("Alice" and "Bob" instead of "Speaker 1" and "Speaker 2"), do the rename as a post-processing step. Listen to the first 30 seconds, identify which numbered speaker is which person, then run a find-and-replace.
When diarization is wrong, what to do
The model gets it wrong sometimes. Here are the common failure modes and the typical fix.
Two speakers becoming three or four labels. Usually because one person's voice changes register (a soft moment, a loud moment) or because there is room reverb that the model is treating as a second voice. Fix: set max to the actual count and re-run. The constraint forces the model to merge.
Two speakers becoming one label. The voices are too similar for the model to separate. Fix: usually unfixable without re-recording with separate microphones. Sometimes setting min to 2 helps; often it does not.
Speaker labels swapping mid-file. Common when audio quality changes (a guest's microphone level shifts, or one speaker moves around the room). Fix: hardest case. Either accept the labels are imprecise and post-process by listening, or re-record with consistent setups.
Phantom speakers from background noise. The model treats a TV in another room or a notification chime as a speaker. Fix: clean up the source audio (remove non-voice content) before re-running, or set max constraint that excludes the phantom.
A workflow that holds up
For a typical multi-speaker recording:
- Confirm the audio is reasonably clean (single-room or multi-mic recording, not a cafe).
- Count the speakers you actually hear in the first minute.
- Enable speaker diarization.
- Set min and max according to the presets above.
- Set output format to detailed JSON.
- Run.
- Spot-check the first three minutes of output: are the speaker labels consistent and roughly right?
- If yes, batch the rest. If no, adjust min/max and re-run on the same sample before committing to the full file.
The pattern that produces the best results is "constrain the model with what you know" plus "verify on a small sample before processing the full hour." Trying to brute-force a 90-minute panel with no constraints almost always produces a transcript you cannot use.
What this is not
Speaker diarization is not voice identification. It tells you "this segment was a different voice from that one," not "this voice belongs to Senator Smith." Identifying named speakers requires either a voice-print database (which the available tools do not include) or a manual pass after transcription. The diarization step gives you anonymous numbered speakers; you do the named-person mapping yourself.
Treat the speaker labels as a useful starting point for a script-style transcript, not as a finished cast list. The model is good at "different voices"; it is not psychic.
Audio Transcription · Z.Tools
Convert audio and video to text with subtitles
