Pick the right transcription output format: TXT, JSON, SRT, VTT

The output-format dropdown on a transcription tool looks like a small UI choice. It is not. It locks you into what you can do downstream. Pick text when you needed timestamps, and you are running the whole transcription again. Pick detailed JSON when you needed something an editor can paste into a document, and you are running a JSON parser. The right format is the one whose downstream tools you actually have.

This is a short, opinionated guide to picking. Five formats, five jobs.

音频转文字

将音视频转为文本，支持字幕导出

What each format actually contains

Plain text is the transcript with no timing, no speaker labels, no structure. Just words and paragraph breaks. Smallest output, easiest to read, useless for anything that needs to know when something was said.

JSON is the transcript split into segments (chunks of a few seconds each), with start and end timestamps per segment. Useful when you want a script-readable structure and timing at the segment level. No per-word timing. No speaker info beyond what the segmenter inferred.

SRT is the universal subtitle format. Plain text segments with start/end timecodes formatted as 00:01:23,456 --> 00:01:27,890, with each cue numbered. Every video editor on the planet imports SRT cleanly. No styling, no metadata, no speaker labels. Comma as the decimal separator (this matters; see below).

VTT (WebVTT) is the web-friendly subtitle format. Same structural idea as SRT but uses a dot decimal separator (00:01:23.456), starts with a required WEBVTT header line, and supports inline styling, positioning, and metadata. Designed to drop into an HTML5 <track> element on a <video> tag.

Detailed JSON (sometimes called verbose JSON in API documentation) is the maximalist option. Segments, words with their own timestamps, optional speaker annotations, plus the same metadata as plain JSON. Largest payload. Required if you want word-level timing or speaker diarization data in any usable form.

Decision tree mapping each format to its primary downstream job, from "I want to read it" through "I want web-embedded captions" to "I want word-level timing for a custom highlighter"

A decision tree, in five branches

Branch one, you want to read or paste the transcript. TXT. Stop. Anything more structured is overkill and you will spend longer cleaning the JSON than you saved.

Branch two, you want to upload subtitles to YouTube, TikTok, Instagram, Vimeo, or any video editor (Premiere, Final Cut, CapCut, DaVinci Resolve). SRT. YouTube actually converts uploaded VTT to SRT internally and strips advanced styling, so even if you generate VTT, you usually end up with SRT effectively.

Branch three, you want to embed captions on a website with a <video> element. VTT. The HTML5 <track> element is built around it. SRT will not load directly in a browser caption track without conversion.

Branch four, you want word-level timestamps for a custom feature (a karaoke-style highlighter, a search-into-audio interface, a podcast jump-to-quote button). Detailed JSON, with the word-level timestamp option enabled. There is no other route to per-word timing in the available formats.

Branch five, you have multiple speakers and you want them labeled. Detailed JSON. The speaker diarization annotations only round-trip cleanly in detailed JSON; in SRT/VTT/JSON they are either dropped or smashed into the segment-level cue.

Side-by-side sample of the same 30 seconds of audio rendered as SRT, VTT, JSON, and detailed JSON, with annotations pointing to where each format gains or loses information

The traps

A few things people get wrong on first use.

"I'll generate detailed JSON, then convert to SRT later." Workable but more work than just generating SRT in the first place. The conversion strips the per-word data anyway, so you are paying for the largest format and throwing away the only thing that justifies it.

"VTT is just SRT with a different decimal." Almost true, but the WEBVTT header line is required, and some video editors silently fail to import VTT because they expect SRT. If you are uploading to a CMS or social platform and not certain it accepts VTT, default to SRT.

"I'll use JSON because it's the most flexible." Standard JSON gives you segment-level timing and nothing else. If you wanted "the most flexible" you wanted detailed JSON. The plain JSON option is a half-measure that mostly serves bots that hate parsing SRT.

"Can I get speaker labels in SRT?" No, not in any standard implementation. Some tools encode speaker IDs as a prefix on the cue text ([Speaker 1] hello), but the SRT format itself has no speaker concept. If labels matter, generate detailed JSON, then process it into whatever format you need.

The under-discussed format: timestamped text

Worth mentioning because it does not exist as a first-class option but is what a lot of people actually want: a plain-text transcript with periodic timestamps inline, like [00:12:34] So what happened next was…. Neither SRT, VTT, JSON, nor detailed JSON gives you exactly this.

The way to get it: generate detailed JSON, then run a small script that walks the segments and emits text with a timestamp every N seconds (or at every paragraph break). It is twenty lines of code and lets you produce show notes, blog drafts, and meeting summaries that read naturally and link back to specific moments.

If you need this format often, do the conversion once and save the script. The base transcription only needs to run once.

A second opinion on the language hint

Adjacent to format choice but worth flagging in the same conversation: the language hint setting. Auto-detect is good. It is not infallible. If your audio is short (under 30 seconds), code-switches between languages, or starts with non-speech audio (music, silence, room noise), auto-detect can land on the wrong language and produce a transcript that looks like words from a related language but is not your transcript.

Setting the language explicitly when you know it is one of those settings worth ten seconds of your time. The tool supports the full Whisper-style language list, which is around 99 languages, so the right answer is almost always available.

What this means in practice

Pick the format before you transcribe, not after. If you are not sure, pick the most-compatible option for your downstream:

Reading or summarizing later: TXT.
Subtitles for any video platform: SRT.
Captions for a website video player: VTT.
Anything needing per-word timing or speaker labels: detailed JSON.

Format choice is reversible only by re-running the transcription. The transcription costs credits. Get it right the first time.

Z.Toolsz.tools

Audio Transcription · Z.Tools

Convert audio and video to text with subtitles

Pick the right transcription output format: TXT, JSON, SRT, VTT

音频转文字

What each format actually contains

A decision tree, in five branches

The traps

The under-discussed format: timestamped text

A second opinion on the language hint

What this means in practice

Audio Transcription · Z.Tools

Whisper across 99 languages: where it shines, where it doesn't

Turn a podcast episode into show notes in 20 minutes

Speaker diarization without losing your mind: a practical guide

音频转文字

What each format actually contains

A decision tree, in five branches

The traps

The under-discussed format: timestamped text

A second opinion on the language hint

What this means in practice

Audio Transcription · Z.Tools

继续阅读

Whisper across 99 languages: where it shines, where it doesn't

Turn a podcast episode into show notes in 20 minutes

Speaker diarization without losing your mind: a practical guide