Word-level timestamps and what to actually build with them

The word-timestamp option on the text-to-speech tool returns the start and end of every word alongside the audio. That metadata is the difference between a player that just plays audio and a player that does follow-along reading, language learning, jump-to-paragraph navigation, and accessibility narration that earns its place in the page.

Og Image

The word-timestamp checkbox on a TTS tool looks like a debugging feature. Turn it on, get a JSON array, see the start and end of every word. Most users tick it once, look at the output, and move on. The interesting story is what you can build with that array, because the answer is "almost any reading-style audio experience that justifies its place on a page."

This is the field guide to what word timestamps actually enable and how to use them in production.

What the timestamps actually contain

The output is an array of objects with a word, a start time in seconds, and an end time in seconds. A short example, abbreviated:

[
  { "word": "The", "start": 0.04, "end": 0.18 },
  { "word": "quick", "start": 0.18, "end": 0.49 },
  { "word": "brown", "start": 0.49, "end": 0.81 },
  { "word": "fox", "start": 0.81, "end": 1.12 }
]

A few non-obvious properties:

  • The end of one word is usually the start of the next, with very small gaps. The model produces continuous narration, so the timestamps reflect the actual acoustic boundaries.
  • Punctuation is mostly absorbed into the surrounding words rather than appearing as separate entries. A period at the end of a sentence shows up as extra duration on the last word, not as its own entry.
  • The timestamps are in the audio's reference timeline. If you change the playback speed in the player, you can scale the values; if you re-encode the audio at a different speed, you must regenerate the timestamps from scratch.
  • The feature is currently English-only in the tool. Other languages still receive audio, but no timestamp metadata.

That is the raw material. The interesting work is using it.

Use case one: follow-along reading

The simplest application is also the most useful: a reading view that highlights each word as it is spoken. Listeners follow along visually, which is helpful for language learners, accessibility users, attention-sensitive readers, and anyone who prefers reading-while-listening to either alone.

Implementation is straightforward. Render the text as a sequence of word-level spans, attach the index to each span, and on each timeupdate event from the audio element, find the word whose timestamp range contains the current time and apply a highlight class to that span. Add a smooth transition; remove the class on the previous word.

A few practical refinements that turn this from a demo into something shippable:

  • Highlight the upcoming word slightly before it speaks (lead by 50–100ms) so the visual change feels in sync with the audio rather than slightly behind it.
  • Auto-scroll the page to keep the highlighted word in view, but only when the user is not actively scrolling. A user who has scrolled to read ahead should not be yanked back to the audio's position.
  • Allow the user to click on a word to seek the audio to that word's start. This turns the display from passive to interactive: the listener can rewind to a specific word by clicking it.
  • For long documents, render only the visible passage rather than the entire transcript at once. A page with thousands of word-level spans gets sluggish; virtualized rendering keeps it responsive.

This is the workflow that justifies the word-timestamp option for most use cases. It is the difference between an audio player on the page and a reading experience.

A diagram showing a follow-along player layout: a reading panel on the left with one word highlighted, an audio control strip below, and a smaller side panel showing the next paragraph queued. Annotations show the data flow, current time -> word lookup -> highlight class -> auto-scroll

Use case two: language learning interactions

For language-learning content, word timestamps enable a class of interactions that simply do not work without them.

Click any word to hear it pronounced individually. The user clicks the word in the transcript. The player seeks to that word's start time and pauses at its end time. The user hears just that word in context, without the surrounding sentence. Useful for vocabulary study where the learner wants to repeat-hear specific words without rebuilding context every time.

Slow-motion phrase repetition. The user selects a phrase (a sequence of words). The player loops the audio segment between the first word's start and the last word's end at half speed. Useful for listening practice where the learner wants to dissect a phrase's pronunciation and prosody at a slower pace. The half-speed playback may not produce native-quality audio, but it is good enough to study individual phonemes.

Word-level pronunciation comparison. The learner records themselves saying a word. The app compares the duration and energy contour of the learner's word to the TTS reference. The word-level boundaries from the timestamps make this comparison possible. With word-level alignment, you can flag specific words where the learner's timing or stress is far from the reference, rather than giving generic "your pronunciation needs work" feedback.

Spaced-repetition vocabulary cards. Auto-extract individual words from the audio with their timestamps. Pair each extracted clip with the word's text and a definition. Build a spaced-repetition deck where each card plays the actual model voice saying the actual word in the actual sentence. This is more useful than a deck where each card has a synthesized word read in isolation, because the in-context audio captures real prosody.

Use case three: jump-to-paragraph navigation

For long-form audio (audio articles, podcast-style content, audiobook chapters), word timestamps make granular navigation possible. The user sees a transcript with paragraph breaks; clicking on any paragraph seeks the audio to the start of that paragraph. This sounds trivial until you compare it to audio without timestamps, where the only navigation is the audio's own scrubber bar, accurate to the second, but useless for finding "the part where the host talks about X."

Implementation: on transcript render, store the timestamp of each paragraph's first word as a data attribute on the paragraph element. On click, seek the audio to that timestamp.

A more sophisticated version: search across the transcript text. The user types "neural network." The app finds the matching paragraph, scrolls to it, and seeks the audio to the matching word. Word timestamps make text-search-to-audio navigation work end-to-end.

Use case four: accessibility-aligned narration

Accessibility narration of long-form text content (news articles, blog posts, documentation pages) is a real use case for the page being read. Synthetic narration with word timestamps gives the page a built-in audio mode that goes beyond "play the article" to "show me where you are reading."

The audio + highlighting combination addresses two accessibility profiles at once: low-vision users who benefit from audio playback, and reading-difficulty users who benefit from the visual reinforcement of seeing the word as it speaks. The same feature serves both groups, where two separate features would be needed without the timestamps.

A note on what this does and does not deliver. Synthetic narration with word timestamps is a useful audio mode, but it is not a substitute for screen-reader compatibility. Screen-reader users have their own preferred voices, navigation patterns, and document-structure expectations. The audio mode on the page is for users who do not run a screen reader at all but prefer audio for other reasons. The two are complementary; the audio mode does not replace ARIA, semantic HTML, or proper heading structure.

Use case five: subtitles and captions

The word-timestamp output is the source data for caption files. Word-level timestamps can be aggregated into phrase-level or sentence-level cues, exported as SRT or VTT, and shipped alongside the audio for any platform that consumes captions.

The aggregation logic is simple. Group consecutive words into cues until either the cue duration exceeds a threshold (around 4-5 seconds is a common cap for readability), the cumulative character count exceeds a threshold (around 80 characters for two-line captions), or a sentence-ending punctuation appears. Emit the cue with the start of the first word and the end of the last word.

For audio that will be embedded in video (a tutorial, a course module, a marketing video), this gives you captions for free as a byproduct of the audio production. Caption files improve accessibility, improve search-engine indexing of the video, and let users watch with sound off, all from the same word-timestamp data the audio production already generated.

A workflow diagram showing the word-timestamp data flowing into four downstream artifacts: follow-along player, language-learning cards, navigation index, and SRT/VTT caption files. Each artifact shows the specific data fields used. The diagram emphasizes that one timestamp generation produces all four

What the timestamps cannot do

A handful of things are tempting but require more than the timestamp data alone.

Speaker labels. The TTS output is single-speaker by construction. If you generated the audio with one voice, every word belongs to that one voice. For multi-speaker scenarios, you generate each speaker's lines separately, get separate timestamp arrays per speaker, and merge them with explicit speaker labels in your application code. The timestamps do not detect speaker changes; you tell them.

Emotion or emphasis tags. The timestamps tell you when each word starts and ends. They do not tell you which words the model emphasized. For applications that want to surface emphasis (highlight a key word more strongly than the rest of the sentence), you must derive emphasis from the script (italics, all-caps, an explicit emphasis tag) rather than from the audio itself.

Phoneme-level timing. The granularity is the word, not the phoneme. For phoneme-level applications (lip-sync animation, phoneme-by-phoneme highlighting, dyslexia-support reading patterns), word timestamps are too coarse. You would need a phoneme-aligned output, which is a different feature than what the option currently provides.

Cross-language alignment. The feature is English-only. For multilingual products, the audio mode works only on the English locale; other languages get audio but no timestamps, and you must build the experience differently for those locales.

A minimal end-to-end recipe

For a developer building the simplest useful version of a follow-along reader:

  1. Generate the audio with word timestamps enabled. Save the audio file and the timestamps as separate assets (timestamps as a JSON file alongside the audio).
  2. Render the source text as <span data-i="N">word</span> for each word, where N is the word's index in the timestamps array.
  3. Load both assets on page mount. Wire the audio element's timeupdate to a function that finds the word whose [start, end] contains the current time, applies a .current class to that span, and removes the class from the previous span.
  4. Add an onclick handler on the spans that seeks the audio to the word's start.
  5. Style .current with a background highlight and a smooth transition.

The whole thing is roughly fifty lines of code in any modern frontend stack. The result is a reading experience that feels custom-built and does not exist in any off-the-shelf audio player.

Why this matters more than it looks

Most audio players on the web are bad. They show a waveform and a play button and assume the user is happy listening. The word-timestamp data shifts that assumption. With timestamps, the audio is no longer a sealed media file; it is a structured object the page can interact with, clicking, seeking, highlighting, captioning, segmenting.

The cost of generating with timestamps is small. The interaction quality you can build on top of them is high. For any project where the audio is the page (audiobooks, audio articles, language-learning content, accessibility-narrated documentation), turning the option on and using the data is the difference between a player that does the bare minimum and one that earns the page real time-on-site.

继续阅读