Source audio, duration, and export checklist for AI Audio to Audio

The audio-to-audio panel on Z.Tools accepts a small set of file shapes and rejects the rest before the upload finishes. The five rules that cover almost every why-didn't-this-work question.

Og Image

The audio-to-audio panel on Z.Tools accepts a small set of file shapes and rejects the rest before the upload finishes. Most of the time this is silent and helpful. Occasionally it is silent and frustrating, because the rejection happens before the request reaches the model and the error message is not always specific enough to tell you what went wrong.

This is the missing pre-upload checklist. Five rules that cover almost every "why didn't this work" question I have seen.

Source audio: which models need it

MiniMax Music Cover requires a source clip. There is no from-scratch path on this model; if you do not upload audio, the request is rejected at the panel level before generation starts.

ACE-Step v1.5 Base and v1.5 Turbo accept a source clip optionally. With a source, the model treats it as a remix seed and behaves like a cover model. Without a source, the model generates from a prompt alone.

The decision is upstream: what do you have, and what do you want?

  • A song you want to hear in a different style: source audio, MiniMax or ACE-Step
  • A from-scratch generation from a prompt: ACE-Step only, no source

Duration constraints

MiniMax Music Cover accepts source audio between 6 seconds and 6 minutes. Anything outside that range is rejected before upload finishes. The tool reads duration from file metadata, so a clip that fails to decode never reaches the provider; you see a rejection at the upload step.

ACE-Step has a wider envelope. Source audio between 6 seconds and roughly 5 minutes is the safe range, though the registry does not pin a hard maximum the same way MiniMax does. If you upload something longer than 5 minutes, the model accepts it but the credit hold and the output length both get sized from the source duration, which can produce a more expensive generation than you expected.

The ACE-Step duration slider only applies when no source audio is uploaded. The slider's range is 6 to 300 seconds, with a default of 60. Once a source clip is set, the slider is hidden because the output length follows the source. If you want a 4-minute output and you have only a 90-second source, you cannot get there from one ACE-Step generation; you would need to extend the source first or run multiple generations and stitch them.

Pre-upload checklist for AI Audio to Audio

File format

Both models accept MP3 and WAV at the upload step. M4A from voice memos has to be converted; FLAC and OGG are not currently accepted as source even though they are valid output formats. The simplest conversion path on macOS is Audacity or Quick Look's "Open with > GarageBand" trick; on Windows, Audacity or VLC's convert/save feature.

Sample rate is normalized to stereo 48 kHz inside ACE-Step regardless of what you upload. MiniMax has its own internal handling. There is no benefit to uploading at a higher sample rate than your source recording.

A practical rule: if you exported from a voice memo app, run the file through a converter to 192 kbps MP3 before uploading. Higher-bitrate files work too, but 192 kbps is plenty for what the model can read and removes any container-format edge case.

Output format

The format selector at output time supports MP3, WAV, FLAC, and OGG. The selection is per-generation; the tool respects what you picked when the result downloads, so a WAV-selected generation comes back as a 16-bit 48 kHz WAV file rather than a transcoded MP3.

The history panel keeps each result with its original format. Re-downloading from history will not silently change the extension, which matters when you have committed a result to a project and want to fetch the same file later.

A rule for picking format: pick MP3 for everything that is going into video editing or social media; pick WAV when you are committing the result to a DAW project; pick FLAC when you want lossless and are storing the file for archive. OGG is rarely the right choice on the audio-to-audio output side.

When to convert before upload

Three cases where pre-upload conversion saves frustration. Voice memo files in M4A or AAC need to be converted to MP3 or WAV (Audacity handles both). Video files with embedded audio need the audio extracted first; the tool does not pull audio from video containers. And 96 kHz files from professional DAW sessions should be downsampled to 48 kHz before uploading, since the model normalizes there anyway.

A 30-second checklist

Before clicking generate:

The source file is MP3 or WAV. The duration is between 6 seconds and 6 minutes for MiniMax (or under 5 minutes if you want a sane credit hold on ACE-Step). The output format selector is set to what you actually need; defaults are MP3 unless you change it. If you are on ACE-Step and uploaded a source, the duration slider is hidden, which is correct. If you are on MiniMax and the lyrics field is empty, the model will use the source vocal's words, which is what most cover users want.

What rejection messages mean

Three error patterns and what they actually point at:

"Source audio is required." You are on MiniMax with no upload, or the upload did not complete. Re-upload and try again.

"Duration outside the supported range." Your source is shorter than 6 seconds or longer than 6 minutes. Trim or extend before uploading.

"Failed to decode source audio." The file is corrupted, in an unsupported codec, or has metadata the parser cannot read. Re-export from your source app to MP3 or WAV. The decoder is reasonably permissive about MP3 variants but strict about codec headers; a clean re-encode usually fixes it.

If you see a different error, the most reliable next step is to halve your file (trim 10 seconds off either end) and try again. Most decode failures are at the file boundaries.

继续阅读