Pick the right text-to-speech format for your output

Most modern TTS APIs let you request audio in seven output formats — MP3, AAC, OGG, Opus, FLAC, WAV, and PCM. Most people pick MP3 once, never change it, and ship the audio. That is fine for a lot of projects. It is not fine for all of them, and the cost of picking the wrong format can range from "the file is twice as big as it needs to be" to "the audio sounds noticeably worse than it has to" to "the player on the destination platform refuses to load it."

This is the practical guide to which format to ask for, organized by where the audio is going. The trade-offs are the focus; the audiophile theory and the marketing claims are out of scope.

The seven formats split into three groups by what they are doing to your audio.

Lossy compression: MP3, AAC, OGG, Opus. These throw away parts of the signal a human ear is unlikely to miss, then encode the rest. They produce small files and are designed for transmission and playback, not editing.

Lossless compression: FLAC. Keeps every sample bit-perfect but takes up roughly half the size of WAV at the same fidelity. Built for archiving and editing pipelines that may re-encode the audio later.

Uncompressed: WAV, PCM. Raw audio samples, no compression at all. Largest files, no decode cost, every editing tool understands them. PCM is the same data as WAV without the WAV header that tells a media player what it is looking at, which is why most players cannot open PCM directly.

For pure spoken audio with one voice and no music bed, the perceptual gap between formats is smaller than for music. Speech is sparse compared to a busy mix, the codecs all do well at low bitrates, and the differences show up at the edges (sibilants, plosives, very quiet passages) rather than in the meat of the narration.

A horizontal stack chart breaking down the seven output formats by category — lossy (MP3, AAC, OGG, Opus) on the left in soft greens, lossless (FLAC) in the middle in amber, uncompressed (WAV, PCM) on the right in slate, with relative file sizes shown as bar widths and a one-word use-case tag under each format

Format-by-destination cheat sheet

The right format is almost always determined by where the audio is going next, not by some abstract notion of quality.

Web embed and small podcast feeds → MP3. Universal browser support, works in every podcast app ever made, every CDN serves it without a hiccup. Default for embedded HTML5 audio players, marketing pages, in-app announcement audio. The only place MP3 falls short is when bandwidth is genuinely tight, in which case Opus is better.

Modern podcast app, voice agent, real-time stream → Opus. Opus at 64 kbps for spoken word sounds better than MP3 at 192 kbps and the file is a third the size. WebRTC, Discord, modern podcast apps, and most voice-agent infrastructure use Opus internally. If your delivery target is "anywhere built in the last five years," Opus is the right call. The catch: very old devices and some legacy podcast hosts do not handle Opus, so confirm the destination supports it before committing.

iOS app, Apple ecosystem player → AAC. AAC at 128 kbps sounds about the same as MP3 at 160 kbps, takes 20 percent less storage, and is the native format for everything Apple ships. If your audio lives in an Apple-flavored pipeline (iOS app bundle, AirPlay stream, GarageBand project), AAC is what the platform was designed around.

Audio you will edit in a DAW, mix into a podcast, or ship to a freelance editor → WAV or FLAC. Lossless is the right call any time the audio will go through more processing. Re-compressing already-compressed audio is the path to artifacts. WAV is the universal raw format every tool supports. FLAC is half the size of WAV and every modern audio editor handles it; the only places FLAC fails are some legacy broadcast workflows and certain video editors that prefer WAV.

Long-term archive of your generated audio → FLAC. Lossless, much smaller than WAV, plays in everything except video editors. If you generate a lot of TTS and want to keep masters around for re-mixing later, FLAC is the format that stops you from running out of disk space.

Embedded systems, telephony, raw signal processing → PCM. Raw samples without any container. Used when the consumer is going to wrap the audio in its own container (telephony systems, custom hardware, audio analysis pipelines). Almost no end-user player can open PCM directly, so do not pick this format unless you have a specific consumer that asks for it.

Older Android device, open-source-only stack, or a player that accepts only Vorbis → OGG. OGG (with Vorbis inside) was the open-source alternative to MP3 in the 2000s and still has support in some open ecosystems. For new projects, MP3 or Opus is almost always the better choice. OGG remains useful when your delivery platform specifically requires it.

What you actually save by picking Opus

The headline number is real. A spoken-word file at 64 kbps Opus is roughly:

33 percent the size of the same audio at 192 kbps MP3.
50 percent the size of the same audio at 128 kbps MP3.
25 percent the size of the same audio at 256 kbps AAC.

For a single-narrator podcast that runs 30 minutes, that is the difference between a 14 MB file and a 5 MB file. For a publisher serving thousands of downloads per episode, the bandwidth difference becomes real money. For a mobile app shipping bundled narration, it is the difference between a comfortable app size and an over-the-air download warning.

The trade-off: anything older than around 2018 will have shaky support, some legacy podcast hosts auto-transcode Opus into MP3 anyway (which gives you the worst of both: the storage of MP3 with the prior decode of Opus baked in), and some listeners report that very old Bluetooth headphones do not handle Opus cleanly. For most modern delivery targets these are non-issues. For broad-compatibility distribution, MP3 is still the safer pick.

Bitrate is where most people get this wrong

Pick a format and then pick the bitrate. The format menu sets the encoding family; the API defaults at most providers (OpenAI, Azure, ElevenLabs, Google Cloud TTS) handle bitrate sensibly for spoken audio. The mistake people make is encoding their TTS at music-grade bitrates because they have heard "320 kbps is best." It is not best for synthetic speech. It is wasted bandwidth.

Sensible bitrates for synthesized speech, by use case:

Voice agent, IVR, real-time stream: 32–48 kbps Opus or 64 kbps AAC.
Podcast and audiobook delivery: 64–96 kbps Opus, 96–128 kbps MP3, or 96–128 kbps AAC.
Web page embed, in-app announcement: 96–128 kbps MP3 (max compatibility).
Editing master before mixing: FLAC or WAV at the native sample rate.
Long-term archive: FLAC.

If you find yourself reaching for 256 kbps MP3 for spoken audio, stop. The extra bits are not making the audio sound more natural; they are encoding parts of the signal the model never produced.

A reference card showing six common TTS destinations on the left (web embed, modern podcast, iOS app bundle, DAW edit, archive, telephony) with a recommended format and bitrate per row, plus a short rationale

One thing the format does not fix

The format converts the audio you have into bytes; it does not change what the model produced. A TTS voice that sounds slightly metallic in MP3 will still sound slightly metallic in WAV, because the metallic quality is in the synthesis itself, not in the encoding. Switching to a lossless format will not save a recording where the wrong voice was picked, the speed was set too high, or the script had a homograph the model mispronounced. Re-generate at the source instead of trying to clean up at the codec layer.

This is the ordinary case for synthetic audio. The exotic case is when the audio is going through additional processing (a voice agent stack, a podcast mix, an editor's pass) and you genuinely need the headroom that lossless gives you. Reach for FLAC or WAV in that case, then deliver in whatever lossy format the destination accepts.

A simple decision rule

If you remember nothing else: pick MP3 when in doubt; pick Opus when bandwidth is the constraint and you control the player; pick FLAC or WAV when the audio will be edited; pick PCM only when something specifically demands it.

Most one-shot TTS jobs are MP3 jobs. Most production pipelines that go through more than one step want FLAC at the source. Almost nobody needs PCM unless they already know they need it.

The format dropdown is a small choice that quietly compounds across thousands of files. Picking deliberately once saves work for everyone downstream.

Pick the right text-to-speech format for your output