The Whisper prompt parameter: what it actually does (and doesn't)

If you have used a transcription tool with a Whisper-style API, you have seen the prompt field. It usually has placeholder text like "guide the transcript style or provide context" and a help-text snippet about acronyms. People treat it like a search filter, or a content moderation control, or a Magic Words box. It is none of those things. It is one specific feature with one specific purpose, and most of the disappointment people report with transcription quality could be fixed in 30 seconds by writing a better prompt.

This is what the prompt actually does. And what it does not.

音频转文字

将音视频转为文本，支持字幕导出

The history that explains the design

Whisper was originally designed to handle audio in 30-second chunks. When the audio is longer, the system slides the window forward and runs the model again on the next chunk. The prompt parameter exists to bridge those chunks: by passing the previous chunk's transcript as the prompt for the next chunk, the model gets a sense of context (style, speaker conventions, topic) and produces consistent output across chunk boundaries.

That is the design intent. Two implications follow.

First, the prompt is contextual hint, not directive. It nudges the model toward a vocabulary and style; it does not command the model to use specific words.

Second, the prompt is a string of natural language, not a list of keywords. The model reads it as if it were the prior conversation, not as if it were a search query.

Third, and this is where most user errors come from: the prompt only influences what the model outputs as text. It does not change what gets transcribed (every speech segment is still transcribed), it does not change the language detected (the model picks language from the audio), and it does not change the timestamps.

What the prompt does well

Three concrete uses, all of which produce visible improvements:

Spelling of proper names and uncommon terms. If your audio is a podcast about Cloudflare's Workers product and the model is rendering "Cloudflare" as "cloud-flair" or "Workers" as a generic noun, drop a sentence into the prompt: "This audio discusses Cloudflare's Workers product and related developer tools." The next transcription will have the right capitalization and spelling.

Domain jargon. A medical podcast on cardiology benefits from a prompt that mentions "pulmonary embolism, atrial fibrillation, transesophageal echocardiogram." The model is better able to land on the right spelling for terms it would otherwise approximate.

Output style. If you want full-formed sentences with capitalization and punctuation, model your prompt that way: "The following is a transcribed lecture on Roman history. The professor speaks in complete, well-punctuated sentences." If you want lower-case stream-of-consciousness for a podcast that is actually conversational, write the prompt that way too. The model picks up the register.

An annotated comparison of the same 60-second audio transcribed twice: once with no prompt and once with a domain-specific prompt, with the corrected proper-noun spellings highlighted in the right column

What the prompt does poorly or not at all

It is easier to be specific about the failure modes than to give a list. Here are the things people try that do not work, with the reason.

Word lists do not work as well as sentences. A prompt of WebAssembly, gRPC, CORS, OAuth, JWT is parseable by the model but performs worse than This audio discusses web technologies including WebAssembly, gRPC, CORS, OAuth, and JWT authentication. The latter gives the model grammatical and conceptual context; the former gives it tokens with no relationships between them.

The prompt does not censor or omit content. A prompt that says "Do not transcribe profanity" or "Skip filler words like 'um' and 'uh'" will be ignored. The model transcribes what it hears. Filtering happens after, in a post-processing step you write yourself.

The prompt does not change the language. If your audio is in Mandarin and you write the prompt in English, the audio still gets transcribed in Mandarin. Use the dedicated language hint setting if you need to override auto-detect.

The prompt does not steer the model to specific topics or filter segments. "Only transcribe the parts about software architecture" is not how this works. Every utterance in the audio gets transcribed. Selecting from the transcript afterward is a different operation.

The prompt has a length cap. Whisper-style APIs use the last 224 tokens of the prompt. (Tokens are not characters; for English text, 224 tokens is roughly 150-180 words.) Anything before that is dropped silently. If you write a 1000-word prompt, the model only sees the last sixth of it. Keep it short.

Two columns: left labeled "what the prompt does" lists three items (proper noun spelling, jargon recognition, output style/register); right labeled "what the prompt does NOT do" lists five items (no censoring, no language switching, no segment filtering, no length-extension beyond 224 tokens, no persistence across jobs)

Writing a good prompt

A practical recipe.

State the topic. One sentence that tells the model what the audio is broadly about.
List proper nouns and jargon you expect to appear, in a sentence. Not as a comma-separated list. Real grammatical sentences.
Give the model a tone and register. "This is a casual conversation between two friends" produces different output than "This is a formal lecture in an academic setting."
Keep it under ~150 words. Above that, you are writing tokens that get dropped.
Use the exact spelling and capitalization you want in the output. Write "JavaScript" if you want JavaScript. Write "JS" if you want JS. The model imitates what it sees.

Example prompt for a developer podcast:

This is a podcast episode about WebAssembly performance, with hosts discussing Rust, V8, JavaScript engines, and tools like wasm-pack and emscripten. The hosts speak conversationally and reference companies including Cloudflare, Google, and Anthropic.

That prompt fixes a specific cluster of common mis-transcriptions (emscripten is rare enough that the model often spells it wrong without the hint) and pins capitalization for the brands.

What this means in practice

Three takeaways:

The prompt is a low-cost lever that fixes a specific class of error (spelling of names, jargon, brand capitalization, output style). When that class of error is what is bothering you, the prompt is the right intervention and the change is dramatic.

The prompt does not fix audio quality, does not change language detection, does not censor or filter, and does not steer the model toward specific segments. If your problem is in any of those buckets, you need a different lever.

Most users underuse the prompt because the placeholder text in the UI undersells it. Try it once on a real piece of audio with a 30-word topic sentence and watch the proper nouns clean up. The first time you see "Cloudflare" come out correctly capitalized after weeks of "cloud flare," you understand why this parameter exists.

Z.Toolsz.tools

Audio Transcription · Z.Tools

Convert audio and video to text with subtitles

The Whisper prompt parameter: what it actually does (and doesn't)

音频转文字

The history that explains the design

What the prompt does well

What the prompt does poorly or not at all

Writing a good prompt

What this means in practice

Audio Transcription · Z.Tools

Whisper across 99 languages: where it shines, where it doesn't

Negative prompts in AI music: when to say what you don't want

How to write better AI music style prompts

音频转文字

The history that explains the design

What the prompt does well

What the prompt does poorly or not at all

Writing a good prompt

What this means in practice

Audio Transcription · Z.Tools

继续阅读

Whisper across 99 languages: where it shines, where it doesn't

Negative prompts in AI music: when to say what you don't want

How to write better AI music style prompts