How to write better AI music style prompts

Most prompts I see for AI music generation read like poetry captions. "A dreamy, ethereal soundscape that whispers of midnight rain and distant hopes." That kind of writing is good for an Instagram post. It is bad for a music model, because the model has no idea what instruments to reach for, what tempo to play at, or what the chorus should do differently from the verse.

What works better is the kind of language a music director uses to brief a session player. Specific instruments. A specific tempo. A specific moment in the arrangement. Three or four concrete details that the model can build a track around.

I have written hundreds of prompts for MiniMax Music Cover and ACE-Step v1.5 over the past few months, and the difference between the prompts that produced something useful and the ones that produced ambient mush is consistent enough to write down.

Treat the prompt like a music director's note

A useful music prompt has four axes. You do not need to hit every axis on every prompt, but a prompt that hits two or three almost always beats a prompt that hits zero, regardless of how nice the prose reads.

The first axis is genre and era. Lo-fi hip-hop. Late-70s funk-pop. Mid-80s synthpop. 2000s pop-punk. The narrower you can pin the era, the better. "Pop" produces wildly different results across runs. "Mid-80s synthpop" produces similar results across runs.

The second axis is instrumentation. List the instruments you want to hear. DX7 electric piano. Gated reverb snare. Slap bass. Brass stabs. Hammered dulcimer. The model picks instruments based on what you name explicitly; if you do not name anything, it picks based on the genre, which produces a more generic version of the genre's defaults.

The third axis is vocal direction. Bright male lead. Smoky female mezzo. Gang chorus on the hook. Auto-tuned spoken-word verse. Without a vocal direction, the model defaults to a generic timbre that often sits awkwardly on the source's melody.

The fourth axis is section moments. The moments that make a song feel like a song rather than a loop. Drum fill before the second chorus. Half-time outro. Brass stab on the chorus downbeat. Builds and releases. These do not always survive the model's interpretation, but naming them increases the odds.

Four-axis prompt anatomy diagram

Four worked examples

The cleanest way to internalize this is to see the rewrite process in action.

Example one: a bad prompt for a folk-to-jazz cover

Bad: "A jazzy, soulful version of a quiet folk song with vintage character."

This is the most common failure mode. Every word is fuzzy. "Jazzy" could mean cool jazz, smooth jazz, bebop, or jazz-pop. "Soulful" is not an instrument or an era. "Vintage character" is decoration that gives the model nothing to act on.

Good: "Late-50s cool jazz cover with brushed snare, upright bass walking lines, muted trumpet melody on the verse, soft piano comping on the chorus, no vocals after the first verse, fade out on a sustained piano chord."

This hits all four axes. Era and genre are pinned to a specific decade and subgenre. Five instruments are named explicitly. The vocal direction is "fade after the first verse" which the model handles cleanly. Two section moments are called: the trumpet/piano alternation between verse and chorus, and the fade-out ending.

Example two: a bad prompt for an instrumental remix

Bad: "An energetic and uplifting electronic remix that gets you moving."

"Energetic and uplifting" describe how a listener feels. The model needs to know what notes are playing.

Good: "High-energy synthwave remix at 128 BPM, four-on-the-floor kick, sidechained pad, retro analog arpeggio on the verse, big supersaw stab on the chorus, breakdown at 2:30 with filtered drums and a one-bar build, no vocals."

Tempo is pinned. The drum pattern is named. The synth voicings are specific. A structural moment (the breakdown) has a timestamp. The model does not always honor "breakdown at 2:30" exactly, but it consistently produces a breakdown somewhere in the second half when you call for one.

Example three: a bad prompt for a vocal-forward cover

Bad: "A heartfelt singer-songwriter cover with emotional vocals and a stripped-back arrangement."

"Heartfelt" and "emotional" are not directable. Singer-songwriter is closer to useful but still ambiguous.

Good: "Acoustic singer-songwriter cover, single fingerpicked nylon-string guitar, intimate male tenor lead with breath audible, soft brushes on snare on the chorus only, light bowed cello on the bridge, no other instrumentation, dry mix, room reverb on the vocal."

Eight specific things. The mix character ("dry mix, room reverb on the vocal") is included because mix character matters as much as instrumentation for an intimate sound. This kind of prompt produces remarkably consistent results across multiple seeds.

Example four: a bad prompt for genre fusion

Bad: "A unique fusion of jazz and electronic with modern production."

Genre fusion is hard to prompt because the two genres pull in different directions. "Unique" is the laziest possible word here; it tells the model nothing.

Good: "Lo-fi hip-hop with jazz violin, dusty boom-bap drums at 88 BPM, jazzy electric piano comping in seventh chords, occasional muted trumpet stabs, vinyl crackle throughout, melancholy late-night focus mood, no vocals."

The fusion is anchored in lo-fi hip-hop's production conventions, with jazz elements layered on top as specific instruments. "Modern production" disappears because the production language is named directly through "boom-bap drums" and "vinyl crackle."

BPM in the prompt and the BPM slider

ACE-Step exposes a BPM slider in the advanced panel. MiniMax Music Cover does not. On both models, putting the BPM in the prompt itself improves adherence; on ACE-Step, putting it in both the prompt and the slider improves it further.

The model's documented behavior treats BPM as an anchor rather than a strict command. The output usually lands within a few BPM of the requested value. If you ask for 88 BPM, expect 85 to 91. If you ask for 120, expect 117 to 124. This is normally close enough.

Where it matters: BPM ranges that fall on genre boundaries. 80 BPM is on the slow end of hip-hop and the fast end of trip-hop; the model's interpretation of those neighboring genres is different enough that hitting one BPM off-target can shift the genre. If you are prompting near a boundary, name the genre explicitly in the prompt and pin the BPM in both places.

When lyrics carry meaning and when caption text does

The lyrics field and the prompt field have different jobs.

The prompt steers the production: instruments, mix, mood, genre, era, structural moments. The lyrics field steers the words and the section structure.

For a from-scratch ACE-Step generation with vocals, the lyrics field is mandatory. Empty lyrics produce instrumental output. Bracket-tag-only lyrics with no actual words also produce instrumental output. If you want the model to sing, you have to give it words.

For a MiniMax Music Cover, the lyrics field is optional and behaves differently. Empty lyrics tell the model to use the source vocal's melody and produce vocals in the new style. A lyric skeleton with the retain-source pattern (the explicit "Keep the original lyrics and phrasing" hint inside section tags) reinforces this behavior. Custom lyrics in the lyrics field replace the source vocal's words entirely.

Bad-to-good prompt rewrite examples

Common failure modes

A short list of mistakes I keep seeing in other people's prompts:

The first is adjective stacking. "Bright, vibrant, energetic, uplifting, modern, fresh, dynamic." None of these tell the model what to play. Cut the adjectives and add nouns.

The second is describing how the listener feels instead of what the song does. "A song that makes you cry" is not a prompt. "A slow piano ballad with a single sustained string pad and a vocal half-step quieter than the piano" is.

The third is mood-only prompts. "Melancholy" is fine as one piece of a prompt. "Melancholy" as the entire prompt produces ambient piano slop.

The fourth is fighting the genre. If you prompt for "country with dubstep wobble bass" you usually get one of the two genres dominating and the other reading as decoration. If you want a genuine fusion, name the production conventions of one genre and the instrumental signatures of the other.

The fifth is ignoring the character limit. MiniMax's 300-character ceiling is real. Prompts that go over are truncated, often mid-clause, which produces stranger results than a tight prompt would.

AI 音频转换

为已有音频赋予全新风格，可生成翻唱、混音以及音乐再创作。由 MiniMax Music Cover 与 ACE-Step v1.5 模型提供支持。

A prompt-writing exercise

If you read this far and want to test it, write three prompts for the same song idea: one in poetry style, one in director-note style, and one with all four axes filled in. Run each on ACE-Step Turbo (about $0.03 per generation) and listen back-to-back.

The director-note style usually beats the poetry-style by enough that you stop writing the poetry version. The four-axis version usually beats the director-note version by a smaller margin, and that margin matters most when you are trying to commit to a final result.

Z.Toolsz.tools

AI Audio to Audio · Z.Tools

Reimagine an existing audio track in a new style — covers, remixes, and music transformations. Powered by MiniMax Music Cover and ACE-Step v1.5.

How to write better AI music style prompts