MiniMax Music Cover: a $0.15 cover that keeps the melody

MiniMax Music Cover does one thing well: it takes a song you already have and reimagines it in a different style while preserving the melody. Flat /bin/zsh.15 per generation. No subscription required.

Og Image

MiniMax Music Cover does one thing well: it takes a song you already have and reimagines it in a different style while preserving the melody. Upload a clip between six seconds and six minutes, write a 250-character style prompt, get a finished cover back. The price is flat at $0.15 per generation. No per-second meter. No subscription required.

That price model is doing more work than it looks. Most AI music systems charge per second of output, per generation count, or per monthly credit bundle. Flat-per-cover quietly tells you what kind of model this is: one shot, one bill, one output. It is the right shape for an actual cover, and the wrong one if you want to iterate. I think that is the most useful thing to know about it before you decide whether to reach for it.

What MiniMax Music Cover actually does

The model takes the melodic and structural skeleton of a source recording and re-renders the rest. Voice timbre changes. Instrumentation changes. Genre changes. Arrangement changes. Tempo can change if you ask. The melody contour, the chord progression, and the song form stay close to the source.

This is a different operation from a remix and a different operation from a from-scratch generation. A remix typically keeps the recording and rearranges it. A from-scratch generation gives you a brand-new track tied to a prompt. Cover, in MiniMax's framing, sits between them: the source defines what gets sung; the prompt defines how it gets sung and played.

The model that powers it is part of MiniMax's Music line. MiniMax Music 2.6 launched on April 10, 2026, and Cover mode is the headline addition in that release. The 2.6 release also brought first-packet latency under 20 seconds and end-to-end chunk latency under 25 seconds, both meaningful for a tool that needs to feel responsive when you are testing several style prompts in a row.

A few specific behaviors are worth knowing:

  • Source audio is required. There is no from-scratch path on this model. If you do not upload a clip, the request is rejected.
  • Source audio must be between 6 seconds and 6 minutes. The tool reads duration from file metadata before upload, so anything that fails to decode never reaches the provider.
  • Prompts are 10 to 300 characters. That is shorter than most people expect, and it is the most important constraint to internalize.
  • There is no negative prompt. If you want to keep something out of the result, you encode that in the positive prompt.
  • There is no strength dial, no cover-conditioning slider, no CFG knob. The model is engineered for one good result, not many tunable ones.
  • Lyrics are optional but accept structured section tags like [Intro], [Verse], [Chorus], [Bridge], and [Outro].

A worked example: acoustic ballad to brass-led 80s synthpop

I tested this with an old voice memo I had on my phone, a fingerpicked acoustic guitar piece I wrote two years ago and never produced. The recording is rough; the melody is fine. About 90 seconds, two verses and a chorus.

The style prompt I used:

Mid-80s synthpop cover with bright male lead, warm DX7 electric piano, gated reverb snare, syncopated bass synth, brass stabs on the chorus, big anthemic chorus drum fills, glittering chime accents, end on a half-time outro.

That is 251 characters. The 300-character limit is real and pinching; I had to cut a clause about the bridge to fit.

Anatomy of a 250-character MiniMax Music Cover prompt

The lyrics field stayed empty for the first run. MiniMax has a documented behavior where the model will use the source vocal for the melody contour and produce vocals in the cover style without you having to retype anything. For a cover meant to keep the original song's words, that is what you want.

For the second run I added a lyric skeleton with the retain-source pattern:

[Intro]
[Verse]
Keep the original lyrics and phrasing from the source vocal.
[Chorus]
Keep the original lyrics and phrasing from the source vocal.
[Bridge]
[Outro]

This is the safest way to communicate "keep the words that the source clip is already singing" to the model. The bracket tags do real work; prose paragraphs do not infer structure. The lyric skeleton with the explicit retain hint produced the most faithful word-for-word cover.

The result on the second run was the kind of thing that feels surprisingly produced for a single-shot generation. Vocals had a different timbre, a fraction more compression, an obvious 80s lead-vocal character. The brass stabs landed on the right chord changes. The gated snare arrived in the chorus. The drum fill at the end of the second chorus did not quite have the dramatic arc I had in mind, but it was close enough that I would mix it rather than regenerate.

Source vs cover spectrogram comparison

Total cost: $0.30 for two generations. Total time: about 90 seconds for the first generation, 80 seconds for the second.

Why flat pricing is the right pricing for a one-shot model

Per-second pricing rewards iteration. You generate, you tweak a parameter, you generate again, and the bill scales with how much you experiment.

Flat pricing rewards conviction. You write the best prompt you can, you run it, and if you wanted to iterate you should have used a different model. That is a feature, not a bug, when the model has been engineered specifically for the one-shot use case. It also means the math is simple: ten covers a month is $1.50. Fifty covers a month is $7.50. Two hundred covers a month is $30. That is below most subscription tiers, with the obvious advantage that you only pay for the ones you actually generated.

The flat price also discourages a failure mode I have seen with per-second models: producers run sweep after sweep at low strength values trying to find a magical result, and the bill quietly runs into double digits. MiniMax Music Cover does not let you do that, which is occasionally frustrating and frequently a relief.

Where MiniMax Music Cover falls short

A few things to set expectations on.

You cannot iterate on a single source clip with subtle parameter changes. The model has no exposed strength knob, no cover-conditioning scale, no guidance type selector. If you want to push the cover closer to or further from the source, the only lever is the prompt. That is enough surface area for most cases but not enough if you are trying to A/B subtle stylistic moves.

You cannot ask for "no autotune" or "no harshness" through a negative prompt. The Music Cover endpoint rejects negative prompts at the server. The workaround is to encode the avoidance in the positive prompt with phrasing like "natural untreated lead vocal, no autotune, dry mix on the lead". This works, but less reliably than a real negative prompt would.

You cannot generate an instrumental cover without lyrics from a source that has them. Or rather, you can ask for one, but the model has been trained to preserve the vocal track's melodic information, so an instrumental request from a vocal source tends to come back with hummed melody lines or syllables. If you want a true instrumental cover, ACE-Step is the better model for that path, with empty lyrics and the vocal language set to instrumental.

You cannot extend or generate beyond the source's length. The output runs as long as the source does. If your source clip is three minutes, the cover is three minutes. If you want a four-minute cover from a three-minute source, you need to extend the source first.

When to reach for MiniMax Music Cover and when to reach for ACE-Step

The decision matrix is short.

MiniMax Music Cover wins when you have a finished or near-finished song you want to hear in a different style, when you want one good result instead of fifty variations, and when you want to spend $0.15 instead of thinking about the bill. It is also the more reliable choice for vocal preservation; it has been trained specifically to keep the source vocal's melody and phrasing.

ACE-Step wins when you want to iterate, when you want to control how close the result hugs the source, when you want to experiment with negative prompts, when you need an output longer than the source, when you want a true instrumental version of a vocal track, or when you are working in a non-English language where ACE-Step's lyric alignment is stronger.

The two models live in the same panel on Z.Tools, and most of my workflow ends up using both. MiniMax for the first faithful cover, ACE-Step for variations once I have a direction I like.

Who should reach for this first

If you have an existing song and you want to hear it in a different genre, MiniMax Music Cover should be the first model you try. The flat price means an unsuccessful experiment costs the same as a successful one, the quality is high, and the prompt-craft skill ceiling is lower than on a model with five extra dials.

If you are still figuring out what direction you want, write three or four 300-character prompts before you upload the source. The prompt is the only steering mechanism, so the work happens before the upload, not during.

继续阅读