Live captions are not the same as accessible captions
Auto-generated captions look like an accessibility solution and usually aren't. Here is the difference between live captions and accessible captions, why ADA and WCAG care, and what good practice looks like in 2026.
There is a comfortable assumption in product teams that "we added auto-captions" is the same as "we made our content accessible." It is not. The first is a technology feature; the second is a legal and ethical commitment with a specific quality bar. The two overlap, but treating auto-captions as the finished accessibility story is how organizations end up in front of the Department of Justice or on the wrong side of a Section 508 audit.
This is the explainer I wish I had read three years ago.

音频转文字
将音视频转为文本,支持字幕导出
What "accessible" actually means here
The Americans with Disabilities Act (ADA) requires "auxiliary aids and services" to ensure effective communication for people with disabilities. For audio and video content, that includes captions for deaf and hard-of-hearing users. The exact obligation depends on the content type and the entity, but the principle is durable: the experience for someone who cannot hear should be equivalent to the experience for someone who can.
The W3C's Web Content Accessibility Guidelines (WCAG) operationalize this. WCAG Level A requires transcripts for prerecorded audio-only content. Level AA requires captions for prerecorded video and live captions for live video. Section 508 of the Rehabilitation Act, which applies to federal agencies and many of their contractors, references WCAG 2.0 Level A and AA as the compliance baseline.
Three things follow:
- The legal frame applies to far more content than people assume. Conference live streams, recorded webinars, marketing videos, training content, customer support videos, even internal all-hands recordings if there are deaf employees on the team.
- The standard is not "we tried." It is "the captions are accurate enough to convey the same information."
- Auto-generated captions, by themselves, almost never meet the standard.
Where auto-captions break
Modern transcription models score impressive Word Error Rates on benchmarks. The numbers in marketing pages range from 2.5 percent to 8.5 percent depending on the system. Sounds great. The problem is that "97 percent accurate" still means roughly 3 wrong words per 100, and the wrong words are not random.
The errors that auto-captions make tend to cluster on:
- Proper nouns (person names, product names, company names).
- Domain jargon (medical terms, legal terms, technical product features).
- Numbers and dates, which auto-captions often render as words instead of digits in inconsistent ways.
- Speaker turns that overlap or interrupt.
- Anything spoken with a heavy accent the model under-trained on.
- Anything spoken in a non-dominant language.
In other words: auto-caption errors cluster on exactly the content that matters most for comprehension. A 3 percent error rate is fine if the errors are randomly distributed across "the" and "and." A 3 percent error rate where the errors are "the CEO's name," "the product code," "the dollar amount," and "the technical term that the entire next sentence depends on" is not fine.
What "accessible captions" looks like
Three things separate accessible captions from auto-generated captions:
Accuracy verified by a human. Someone reads the captions while listening to the audio and corrects errors. For prerecorded content, this is the standard. There is no automation step that replaces this for compliance.
Speaker identification. When more than one person is speaking, the captions identify the speaker. "Speaker 1" is fine if speakers are anonymous; otherwise the captions name them.
Non-speech information. A door slams in the background, music starts, the speaker laughs, applause begins. Captions for accessibility include these ([door slams], [applause], [laughter]) because they are part of the experience.
For live content, the equivalent is CART (Communication Access Realtime Translation): a trained human captioner listening live and producing captions in real time. The accuracy is roughly equivalent to a corrected post-recording transcript, with the latency of a human typing.
Where the tool on this site fits
Auto-transcription tools, including the one on this site, are speed-up tools for the first pass. They are not accessibility solutions on their own. The honest framing is:
- For prerecorded audio: generate the transcript, then have a human review and correct it. The corrected transcript becomes accessible captions in SRT or VTT.
- For live audio: do not rely on the auto-transcription tool. Hire a CART captioner for events where accessibility matters. Auto-tools are not yet accurate enough at the latency required for live use.
- For internal-only content where no compliance obligation exists: auto-captions are fine, with the understanding that some errors will slip through.
Naming this gap is more honest than implying the tool is a turnkey accessibility solution. It is not. It is a useful starting point for a workflow that ends with a human reviewer.
A workflow that meets the bar
For a typical recorded webinar or training video, the path that hits WCAG Level AA looks like this:
- Generate the auto-transcript. Use the highest-accuracy model available, with speaker diarization enabled and the source language explicitly set. Output as SRT or detailed JSON if you want to programmatically post-process.
- Human review pass. A reviewer plays the video alongside the transcript at slightly above normal speed. Corrects errors, fixes proper nouns, normalizes numbers, adds non-speech annotations, fixes speaker labels.
- Spot-check on the corrections. A second reviewer (or the original creator) spot-checks 10 percent of the captions to catch missed errors.
- Sync verification. Plays the captioned video at normal speed, watches that the captions appear at the right time. Adjusts timestamps where the auto-transcription drifted.
- Publish with the captions enabled by default in the player.
Time cost: for a 30-minute video, the full workflow takes 60-90 minutes of human time. The auto-transcript step takes a few minutes; the rest is the human review and verification. Compared to typing from scratch (which takes 4-6 hours for the same video), this is dramatically faster.
For live events: book a professional CART captioner. The cost varies but the path is straightforward: a credentialed human listens live, types the captions, the captions stream to viewers via the event platform.
The argument I want to make explicitly
It is tempting to treat accessibility as a checkbox: "we have captions" gets ticked because the platform shows captions on the video. The captions exist. The accessibility commitment is not honored.
The opposite framing is harder but more honest. Accessibility is not a feature you ship; it is a commitment to a quality bar. Auto-captions help meet that bar faster, but they do not meet it on their own. The right metric is not "do captions exist" but "are the captions good enough that a deaf user gets the same information as a hearing user."
For the kinds of content most teams produce (recorded talks, training videos, marketing content), the corrected-auto-caption workflow above is the right answer. It uses the modern transcription tools where they are good and uses humans where they are still required. The path is not new; it is just under-promoted because "use a human reviewer" is not a SaaS marketing line.
For the audio-transcription tool on this site, the appropriate framing is "this saves you time on the first pass; the second pass is still your job." That framing is less impressive than "we made your content accessible." It is also true.
Audio Transcription · Z.Tools
Convert audio and video to text with subtitles
