Bracket tags and lyrics format for AI music models

Models like MiniMax Music Cover and ACE-Step do not parse English-language song structure from how words look on a page. They look for explicit bracket tags.

Og Image

The most common reason an AI music model produces a confused vocal track is that the lyrics field was filled with prose paragraphs instead of structured sections. Models like MiniMax Music Cover and ACE-Step v1.5 do not parse English-language song structure from how the words look on a page. They look for explicit bracket tags, and without them the model picks an interpretation that is rarely the one you wanted.

This is one of those rules that sounds obvious once you know it and is invisible until you do.

What the standard tag set looks like

Both MiniMax Music Cover and ACE-Step v1.5 accept the same core set of section tags, in square brackets, on their own lines:

[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]

ACE-Step's tutorial documents a wider set that the model has been trained on:

[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Build]
[Drop]
[Instrumental]

The minimum useful structure is [Verse] and [Chorus]. Everything else is optional and depends on the song shape you want.

Why prose paragraphs fail

A common failure mode looks like this:

This is the first verse where I'm telling the story
about a friend I knew who lived down the road
in the kind of town where the streetlights glow
and the wind comes off the lake at night

This is the chorus where everything lifts up
and the harmonies come in around me
and we sing about the feeling of going home

This looks like a song to a human reader. To the model, it looks like a single block of text with no section information. The model decides where the verse ends and the chorus begins by guessing, and the guess is rarely what you intended. The melody contour of the chorus does not lift, the dynamics do not shift, and the instrumentation does not change between the two sections.

The fix is mechanical: add the bracket tags.

[Verse]
This is the first verse where I'm telling the story
about a friend I knew who lived down the road
in the kind of town where the streetlights glow
and the wind comes off the lake at night

[Chorus]
This is the chorus where everything lifts up
and the harmonies come in around me
and we sing about the feeling of going home

The model now knows where to apply chorus-level dynamics, where to bring in chorus harmonies if the prompt asks for them, and where to layer extra production touches.

Prose vs bracket-tag lyric formatting

Section length and syllable count

A practical guideline from ACE-Step's musicians guide: aim for 6 to 10 syllables per line within a section, with reasonably consistent line lengths inside each section. The model handles uneven line lengths but tends to compress or stretch syllables to fit when the variance is high. If a verse line is 8 syllables and the next line is 22, the long line probably gets truncated or rushed.

Within a song, sections do not need to match each other in length. A verse with 8-syllable lines can sit next to a chorus with 6-syllable lines without confusing the model.

Stacking modifiers

ACE-Step accepts modifier syntax like [Chorus - anthemic] or [Verse - intimate]. These are not part of the original training tag set but the model has learned to read them as additional context.

The team's own guidance is to avoid stacking too many modifiers on a single tag. [Chorus - anthemic - layered - distorted] produces less consistent results than [Chorus - anthemic]. If you need to convey several attributes about a section, the better path is to put the descriptors in the prompt rather than in the lyric tag.

MiniMax Music Cover does not document modifier syntax explicitly, and in practice plain [Chorus] is the safer choice on that model.

The retain-source pattern for MiniMax covers

MiniMax Music Cover has a pattern unique to its cover use case. When you want the cover to keep the source vocal's words and phrasing rather than replace them, the convention is to write a section skeleton with a short hint inside each section:

[Intro]
[Verse]
Keep the original lyrics and phrasing from the source vocal.
[Chorus]
Keep the original lyrics and phrasing from the source vocal.
[Bridge]
[Outro]

This is more reliable than leaving the lyrics field empty when you want a faithful cover. Empty lyrics work too, but the explicit retain hint produces more consistent vocal preservation across multiple generations.

Common mistakes that look like the model failing

Three patterns I see repeatedly that get blamed on the model when the lyrics format is the cause.

The first is mixed prose and tags. A lyric field that starts with a [Verse] tag, then has a paragraph break, then has another paragraph without a tag. The model treats the second paragraph as a continuation of the verse, regardless of what you intended.

The second is brackets without line breaks. [Verse] First line of the verse here [Chorus] First line of the chorus here on a single line confuses the parser. Bracket tags need to be on their own lines.

The third is misspelled or non-standard tags. [verse 1] works less reliably than [Verse]. [Pre-chorus] works on ACE-Step (which has been trained on it) but inconsistently on MiniMax. Stick to the standard set when possible.

A 30-second mental model

The lyrics field is structured input, not free-form text. Bracket tags are mandatory section dividers. Lines within a section are the lyrical content. Everything else (melody, dynamics, harmony, production) comes from the prompt.

Treat the lyrics field the way you would treat a JSON file with a strict schema. The structure is what the model reads first; the words inside are what it sings second.

继续阅读