The Whisper prompt parameter: what it actually does (and doesn't)
The prompt field in Whisper-style transcription APIs is the most under-documented setting in the panel. Here is exactly what it does, what it does not do, and how to write one that actually works.
If you have used a transcription tool with a Whisper-style API, you have seen the prompt field. It usually has placeholder text like "guide the transcript style or provide context" and a help-text snippet about acronyms. People treat it like a search filter, or a content moderation control, or a Magic Words box. It is none of those things. It is one specific feature with one specific purpose, and most of the disappointment people report with transcription quality could be fixed in 30 seconds by writing a better prompt.
This is what the prompt actually does. And what it does not.

音频转文字
将音视频转为文本,支持字幕导出
The history that explains the design
Whisper was originally designed to handle audio in 30-second chunks. When the audio is longer, the system slides the window forward and runs the model again on the next chunk. The prompt parameter exists to bridge those chunks: by passing the previous chunk's transcript as the prompt for the next chunk, the model gets a sense of context (style, speaker conventions, topic) and produces consistent output across chunk boundaries.
That is the design intent. Two implications follow.
First, the prompt is contextual hint, not directive. It nudges the model toward a vocabulary and style; it does not command the model to use specific words.
Second, the prompt is a string of natural language, not a list of keywords. The model reads it as if it were the prior conversation, not as if it were a search query.
Third, and this is where most user errors come from: the prompt only influences what the model outputs as text. It does not change what gets transcribed (every speech segment is still transcribed), it does not change the language detected (the model picks language from the audio), and it does not change the timestamps.
What the prompt does well
Three concrete uses, all of which produce visible improvements:
Spelling of proper names and uncommon terms. If your audio is a podcast about Cloudflare's Workers product and the model is rendering "Cloudflare" as "cloud-flair" or "Workers" as a generic noun, drop a sentence into the prompt: "This audio discusses Cloudflare's Workers product and related developer tools." The next transcription will have the right capitalization and spelling.
Domain jargon. A medical podcast on cardiology benefits from a prompt that mentions "pulmonary embolism, atrial fibrillation, transesophageal echocardiogram." The model is better able to land on the right spelling for terms it would otherwise approximate.
Output style. If you want full-formed sentences with capitalization and punctuation, model your prompt that way: "The following is a transcribed lecture on Roman history. The professor speaks in complete, well-punctuated sentences." If you want lower-case stream-of-consciousness for a podcast that is actually conversational, write the prompt that way too. The model picks up the register.
What the prompt does poorly or not at all
It is easier to be specific about the failure modes than to give a list. Here are the things people try that do not work, with the reason.
Word lists do not work as well as sentences. A prompt of WebAssembly, gRPC, CORS, OAuth, JWT is parseable by the model but performs worse than This audio discusses web technologies including WebAssembly, gRPC, CORS, OAuth, and JWT authentication. The latter gives the model grammatical and conceptual context; the former gives it tokens with no relationships between them.
The prompt does not censor or omit content. A prompt that says "Do not transcribe profanity" or "Skip filler words like 'um' and 'uh'" will be ignored. The model transcribes what it hears. Filtering happens after, in a post-processing step you write yourself.
The prompt does not change the language. If your audio is in Mandarin and you write the prompt in English, the audio still gets transcribed in Mandarin. Use the dedicated language hint setting if you need to override auto-detect.
The prompt does not steer the model to specific topics or filter segments. "Only transcribe the parts about software architecture" is not how this works. Every utterance in the audio gets transcribed. Selecting from the transcript afterward is a different operation.
The prompt has a length cap. Whisper-style APIs use the last 224 tokens of the prompt. (Tokens are not characters; for English text, 224 tokens is roughly 150-180 words.) Anything before that is dropped silently. If you write a 1000-word prompt, the model only sees the last sixth of it. Keep it short.
Writing a good prompt
A practical recipe.
- State the topic. One sentence that tells the model what the audio is broadly about.
- List proper nouns and jargon you expect to appear, in a sentence. Not as a comma-separated list. Real grammatical sentences.
- Give the model a tone and register. "This is a casual conversation between two friends" produces different output than "This is a formal lecture in an academic setting."
- Keep it under ~150 words. Above that, you are writing tokens that get dropped.
- Use the exact spelling and capitalization you want in the output. Write "JavaScript" if you want JavaScript. Write "JS" if you want JS. The model imitates what it sees.
Example prompt for a developer podcast:
This is a podcast episode about WebAssembly performance, with hosts discussing Rust, V8, JavaScript engines, and tools like wasm-pack and emscripten. The hosts speak conversationally and reference companies including Cloudflare, Google, and Anthropic.
That prompt fixes a specific cluster of common mis-transcriptions (emscripten is rare enough that the model often spells it wrong without the hint) and pins capitalization for the brands.
What this means in practice
Three takeaways:
The prompt is a low-cost lever that fixes a specific class of error (spelling of names, jargon, brand capitalization, output style). When that class of error is what is bothering you, the prompt is the right intervention and the change is dramatic.
The prompt does not fix audio quality, does not change language detection, does not censor or filter, and does not steer the model toward specific segments. If your problem is in any of those buckets, you need a different lever.
Most users underuse the prompt because the placeholder text in the UI undersells it. Try it once on a real piece of audio with a 30-word topic sentence and watch the proper nouns clean up. The first time you see "Cloudflare" come out correctly capitalized after weeks of "cloud flare," you understand why this parameter exists.
Audio Transcription · Z.Tools
Convert audio and video to text with subtitles

- whisper
- transcription
- prompt-engineering
- deep-dive
继续阅读
Whisper across 99 languages: where it shines, where it doesn't
Whisper supports 99 languages on paper. In practice, the accuracy spread between Tier 1 and Tier 4 languages is large enough to change which workflows are feasible and which are not. Here is a field guide.
Negative prompts in AI music: when to say what you don't want
A negative prompt tells the model what to avoid. The cases where it actually moves the result are narrower than most producers assume.
How to write better AI music style prompts
Most prompts I see for AI music generation read like poetry captions. They produce ambient mush. Director-note style with specific instruments, era, and section moments produces useful results.