Instrumental prompts for ACE-Step: keeping the vocals out
ACE-Step is a music model trained primarily on tracks with vocals, so its default assumption is that vocals belong somewhere. Getting a clean instrumental requires a few specific moves.
A common failure mode I see when producers reach for ACE-Step v1.5 to generate instrumental music: the output comes back with humming, syllables, or a buried vocal line that does not belong there. The prompt asked for an instrumental track. The model heard "track with vocals not foregrounded."
ACE-Step is a music model trained primarily on tracks with vocals, so its default assumption is that vocals belong somewhere. Getting a clean instrumental from it requires a few specific moves. None of them are obvious from the UI.
The two dials that actually keep vocals out
Two settings need to line up.
The first is the lyrics field. Empty lyrics tell ACE-Step to generate without singing. A lyrics field with bracket tags only and no text produces inconsistent results; sometimes you get an instrumental, sometimes you get a vocalise. The safest pattern is a fully empty lyrics field, no tags, no whitespace.
The second is the vocal language picker. ACE-Step exposes 18 explicit vocal language options plus an instrumental option labeled "Instrumental / Auto." That option is the one you want for a clean instrumental request. It tells the model to skip the vocal synthesis pipeline entirely rather than treating "no lyrics" as "lyrics in some default language I should still attempt."
When both are set, ACE-Step produces instrumental tracks reliably. When only one is set, the result drifts.
Prompt language that signals instrumental
The prompt itself can either help or hurt the cause. Three patterns I have found genuinely move the needle:
Naming the instruments and not the singers. A prompt that lists piano, drums, bass, and synth pad and never mentions a vocal element gives the model less reason to add one. A prompt that says "no vocals" while also describing "the song" or "the track" leaves a small opening because both nouns imply some vocal.
Including the literal phrase "no vocals." It works as a positive reinforcement signal. "No vocals" inside a positive prompt is a stronger signal than the empty lyrics field alone. Belt and suspenders.
Avoiding genre prompts that imply vocals. A "pop" or "ballad" or "folk song" prompt makes the model expect vocals because the training data for those genres almost always has vocals. "Underscoring," "BGM," "instrumental score," "scoring music," and "video game music" are genres where instrumental versions are common, and the model has internalized that.
Three worked prompts
The first time I tried each of these, I tested with both an empty lyrics field and the instrumental vocal language setting. All three produced clean instrumental output across multiple seeds.
Prompt one: lo-fi study music
Lo-fi hip-hop instrumental, dusty boom-bap drums at 88 BPM, jazzy electric piano comping in seventh chords, mellow upright bass walking lines, vinyl crackle, late-night focus mood, no vocals.
What worked: the genre is named in a way that fits instrumentals. Two harmonic instruments (electric piano, upright bass) plus drums plus a textural element (vinyl crackle). The mood is described without reference to song or vocal character. "No vocals" at the end as belt-and-suspenders.
Prompt two: synthwave underscoring for a video
Mid-80s synthwave underscoring, 110 BPM, retro analog arpeggio, gated reverb snare, sidechained pad, four-on-the-floor kick, cinematic synth melody on top, instrumental score, no vocals.
What worked: "underscoring" frames it as a music-for-video purpose, where instrumentals are common. "Cinematic synth melody on top" describes the lead element so the model does not feel the need to add a vocal lead.
Prompt three: orchestral video game scoring
Orchestral video game scoring, slow strings rising, soft horns, timpani swells on the climax, harp arpeggios in the verse, no vocals, no choir.
What worked: "video game scoring" carries instrumental connotations. The instruments are named and arranged across sections so the model has a full palette without needing a vocal to fill space. The double negative "no vocals, no choir" cuts off two adjacent failure modes; without "no choir," orchestral scoring prompts sometimes come back with wordless choral pads.
What does not work
Three patterns that look like they should help but do not.
Saying "instrumental" once in a long prompt. If the prompt has 200 characters of vocal-implying language and the word "instrumental" buried somewhere in the middle, the model treats it as one feature among many rather than a global instruction. "Instrumental" needs to land near the front of the prompt or near the end, not in the middle.
Leaving lyrics filled with bracket tags only. [Verse] and [Chorus] on their own lines with no actual lyrics seems like it should produce an instrumental with section markers. In practice, the model often interprets those tags as cues to add vocalise or hummed lines that follow the section structure. A fully empty lyrics field is more reliable.
Setting the vocal language to a specific language while requesting an instrumental. "English" plus an instrumental request gives the model contradictory signals. The vocal-synthesis pipeline runs, then has to be silenced, and what often comes through is a quiet but audible vocal layer.
ai-audio-to-audioA note on MiniMax Music Cover
MiniMax Music Cover is the wrong tool for this job. The model is trained to preserve the source vocal's melodic line, so an instrumental request from a vocal source typically comes back with hummed or sung-syllable melody lines that follow the original singer's contour. Use a stem splitter, or ACE-Step from scratch.
A 30-second checklist
Before you generate an instrumental on ACE-Step:
The lyrics field is empty. No text, no tags, no whitespace. The vocal language is set to "Instrumental / Auto." The prompt names instruments rather than song elements. The phrase "no vocals" appears in the prompt at least once. The genre or purpose framing (underscoring, BGM, scoring, video game music) implies an instrumental context.
If all five are true, the output is reliably vocal-free. Skip any of them and the model gets a small opening to add a singer. The model will often take it.
Page Not Found · Z.Tools
The page you're looking for doesn't exist or has been moved.