Multilingual AI covers: what the vocal language picker actually does

ACE-Step exposes 18 explicit vocal languages and claims 50+ in the underlying model. The two numbers do not contradict each other; the realistic picture is more layered.

Og Image

ACE-Step v1.5's vocal language picker exposes 18 languages on the audio-to-audio panel: English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian, Hindi, Arabic, Turkish, Polish, Dutch, Indonesian, Thai, Vietnamese, plus an "Instrumental / Auto" option. The model team's marketing claims 50+ languages in the underlying model. The two numbers do not contradict each other; the 18 are the languages the team has pinned to specific phonetic configurations, and the 50+ are languages the model can attempt with mixed results.

What this means in practice is that "ACE-Step supports 50+ languages" is true and misleading. The realistic picture is closer to: a small set of languages where the lyric phonetics land cleanly, a larger set where they land in a recognizable but rougher form, and a long tail where the model produces something that sounds linguistically plausible but is not actually the language you asked for.

This article maps the picture I have built up after testing ACE-Step in 12 of those languages.

What the vocal language setting does technically

ACE-Step's diffusion synthesizes audio conditioned on the text prompt and, when present, the lyrics. The vocal language setting tells the vocal-synthesis pipeline which phonetic system to use when interpreting the lyrics. English with Latin-script lyrics goes to the English phonetic model; Chinese with Chinese-script lyrics goes to the Mandarin phonetic model. The mapping is per-language, and each phonetic model is trained on a different slice of the training corpus.

The "Instrumental / Auto" option skips this pipeline entirely. It is the right choice when you want a track with no vocals; it is the wrong choice when you want a vocal track in a specific language and you have left the field at default.

Per-language quality, my take

After roughly 200 generations split across 12 languages, here is what I would tell a producer about each tier.

Strong tier (use without reservation): English, Mandarin Chinese, Japanese, Spanish.

These are the four languages where ACE-Step's lyric alignment is reliably good. Phonetics land. The model handles tonal characteristics in Mandarin and Japanese without smearing. Spanish vocal phrasing matches the natural cadence of the language. English is the model's strongest language overall, but the gap to Mandarin and Japanese is small, and visibly smaller than the gap on Suno v5.

Solid tier (use with one or two prompt tweaks): Korean, French, German, Italian, Portuguese, Hindi.

These languages produce competent results. Phonetics are recognizable. Occasional smearing on dense consonant clusters in Korean. French nasal vowels can sound slightly off when the BPM is high. German consonants can feel softer than native speech. Hindi handling is one of the best for South Asian languages on any AI music model I have tested.

The fix for the small-but-noticeable issues in this tier is usually to lower the BPM slightly or raise the steps count on ACE-Step Base. The model handles these languages well at slower tempos and with more refinement passes.

Acceptable tier (use for prototypes, expect rework): Russian, Arabic, Turkish, Polish, Dutch, Indonesian, Thai, Vietnamese.

The model produces output in these languages but the quality is noticeably below the strong tier. Russian vowel handling is decent but consonants smear at speed. Arabic emphatic consonants are flattened. Tonal handling in Vietnamese and Thai is unreliable. Indonesian and Dutch are fine for slower ballads and rough on fast genres.

Use these for prototype-level work where you want to test a direction before committing. For final production in any of these languages, expect to fix the vocal phonetics in post-production or to re-record the vocal yourself.

Long tail (the 50+ claim): every other language the model can attempt.

These do not have a dedicated phonetic configuration. The model takes its best guess based on what it learned from training data. Output ranges from "linguistically plausible but unintelligible" to "actually decent on simple prompts." For any language outside the 18 explicit options, plan to test before committing.

Per-language quality tiers for ACE-Step vocal output

When language and lyric script disagree

A common failure mode: the lyrics field has Mandarin text in Chinese script, but the vocal language picker is set to English. The model receives contradictory signals. Output usually comes back with English-phonetic interpretation of the Chinese characters, which is typically gibberish.

The fix is mechanical: match the picker to the script. Mandarin lyrics in Chinese script → vocal language Chinese. Japanese lyrics in Hiragana/Katakana → vocal language Japanese. The picker is not auto-detected; you have to set it explicitly.

The reverse case is more interesting. If you have Spanish lyrics in English transliteration ("ola, koh-mo es-tahs" instead of "hola, ¿cómo estás?"), the picker setting determines what the model does. Set to Spanish, the model takes the lyrics phonetically and produces something that sounds Spanish-adjacent. Set to English, the model treats them as English nonsense words. Neither produces clean Spanish output; the cleanest path is always to write the lyrics in the target language's native script.

A worked example: Mandarin cover

A specific case I tested recently: a cover of an English folk song reimagined with Mandarin lyrics. Source audio was a 90-second acoustic guitar demo with English vocals.

I used ACE-Step Base with a source clip. Strength 0.5 to allow creative interpretation. Vocal language set to Chinese. Lyrics field with [Verse] and [Chorus] tags wrapping Mandarin lyric content in Chinese script. Style prompt:

Mandarin folk-pop cover, female lead with bright cantonese-influenced phrasing, light fingerpicked acoustic guitar, soft strings on the chorus, melancholy mood, traditional Chinese flute on the bridge, mid-tempo.

The result on the first generation was a faithful Mandarin cover with clean vowel handling and recognizable consonants. The flute on the bridge appeared. The female lead's cadence matched typical Mandarin pop phrasing rather than English transposed-into-Mandarin phrasing, which is the failure mode I was watching for.

This is what the multilingual story looks like when it works: the picker, the script, and the prompt all line up.

What MiniMax does differently

MiniMax Music Cover handles language differently. There is no exposed vocal language picker; the model auto-detects from the lyrics field content. Lyrics in Chinese script → Mandarin output. Lyrics in Latin script → English output by default, with some heuristic for Spanish, French, and German script characteristics.

This works cleanly for the major languages and falls apart for less common ones. MiniMax's coverage of Hindi, Arabic, and Vietnamese is noticeably weaker than ACE-Step's. The auto-detect approach is more user-friendly; the explicit-picker approach is more controllable. Different design choices, different tradeoffs.

The practical takeaway: for non-English work in a language where ACE-Step has a strong-tier configuration (Mandarin, Japanese, Spanish), ACE-Step is the more reliable choice over MiniMax. For Korean, French, German, Italian, Portuguese, and Hindi, the two models are roughly comparable. For everything else, ACE-Step's wider language exposure makes it the safer first try, even though the quality floor is uneven.

A small note on regional variants

ACE-Step's "Chinese" picker handles Mandarin Chinese. Cantonese, Wu, and other Sinitic languages are not separately configured; they fall into the long tail. The same applies for Spanish (Iberian Spanish is the default; Mexican, Colombian, Argentine variants come out close but not always natural) and Arabic (Modern Standard Arabic is the strongest variant; regional dialects vary).

For any project where the regional variant matters, expect some rework on the vocal phonetics. The model has not been trained to distinguish regional accent variants the way a human session vocalist would.

A closing rule

Two-line rule for picking the vocal language setting:

If your lyrics are in one of the 18 explicit languages, set the picker to that language and write the lyrics in that language's native script. If your lyrics are in any other language, set the picker to the closest one in the 18-list, expect rougher phonetics, and plan to test before committing.

The picker is the most-overlooked control on ACE-Step's panel. Setting it correctly takes one click and determines whether the vocal lands or smears.

继续阅读