ACE-Step v1.5 review: open-source AI music after Suno v5
ACE-Step 1.5 went live on January 28, 2026. The headline number, a SongEval score of 8.09, beats Suno v5. Both claims are real. Neither tells the whole story.
ACE-Step 1.5 went live on January 28, 2026. The headline number, a SongEval score of 8.09, beats Suno v5. The headline architecture, a 3.5 billion parameter open-source model that runs on a $400 graphics card, sounds even more disruptive. Both claims are real. Neither tells the whole story.
I have spent about a month with ACE-Step routed through the audio-to-audio panel on Z.Tools, and I think it is the most interesting thing to happen to text-to-music in 2026. The benchmark headline is a little misleading. The practical experience depends a lot on which output you care about most.
What ACE-Step is, in plain language
A joint project from ACE Studio and StepFun. Open source. Released as v1 in May 2025, v1.5 in January 2026. The 1.5 release is the one worth paying attention to.
Architecturally, it splits the work between two specialists. A Language Model takes a short user query and turns it into what the team calls a song blueprint, which is a structured plan for tempo, key, instrument list, mood, and section layout. A Diffusion Transformer then synthesizes the audio from that plan. The two pieces communicate through chain-of-thought rather than through a single fused embedding, which is why the team can swap in a smaller or larger Language Model independently of the audio decoder. The XL variant scales the diffusion stage to 4 billion parameters; the standard one stays at 3.5 billion.
Three supporting pieces matter for understanding why the model behaves the way it does:
- Sana DCAE compresses the audio into a latent that the diffusion stage can handle quickly, which is the main reason inference is so fast on consumer hardware.
- MERT and m-hubert handle semantic alignment during training, and they are part of why lyric phonetics actually land most of the time, including in Asian languages.
- Intrinsic reinforcement learning replaces the external reward model that most music systems use, which simplifies the training pipeline and means the team can update the alignment behavior without retraining a separate scorer.
The full stack is open. Weights, code, and an interactive demo are on GitHub, Hugging Face, and ModelScope. That matters more than it sounds, and I will come back to it.
The benchmark numbers, honestly
Here is the comparison the launch coverage leaned on, with all five published metrics rather than the one that flatters ACE-Step the most.
The wins:
- SongEval 8.09. The first time an open model has held this line against a top-tier commercial system.
- Generation speed under 2 seconds per song on an A100. Under 10 seconds on an RTX 3090. The team's claim of 10 to 120x faster than alternatives is roughly correct, depending on the alternative.
- VRAM under 4 GB for the standard model. A laptop GPU runs it.
The losses:
- AudioBox: Suno v5 7.87 versus ACE-Step 1.5 7.42. Audio quality polish still favors Suno.
- Style alignment: Suno scores 46.8 against ACE-Step's 6.47. The scales are different across systems, but the qualitative read is consistent: Suno does a more reliable job of matching the requested genre.
- Lyric alignment: Suno 34.2 vs ACE-Step 8.35. Similar caveat. Suno's vocal phonetics line up with the lyrics more dependably across languages.
Subjective listening matches the numbers. The ACE team's own paper says human evaluators rank ACE-Step 1.5 between Suno v4.5 and v5, and that lines up with how it sounds to my ears. Suno has a polish that ACE-Step has not closed yet. Voices feel a fraction more produced. Instruments sit more obviously in their slots in the mix. The reverb tails are smoother.
The gap is small though. Smaller than the gap between Suno v4 and any open-source model that existed twelve months ago. The trajectory matters as much as the snapshot.
Where ACE-Step is genuinely the best option
A few areas where I would reach for ACE-Step before any of the commercial alternatives, regardless of openness or pricing:
Cover generation with a real dial. Feed ACE-Step a source clip and a style prompt and it preserves the melodic shape, rhythm, and song form while changing the instrumentation, production, and vocal timbre. The strength parameter from 0 to 1 gives you a continuous dial. The community recommends 0.3 to 0.5 for dramatic genre change, 0.5 to 0.7 for moderate restyling, 0.7 and up for subtle. MiniMax Music Cover does the same job differently, with no exposed strength control and a flat per-cover price; that is the right tool for one-shot covers but the wrong one if you want to iterate.
The internal handling of the source clip is worth understanding because it explains some of the model's stranger behaviors. ACE-Step normalizes any uploaded audio to stereo 48 kHz, detects and ignores silence, and repeats anything shorter than 30 seconds until it reaches that minimum length. It then samples three 10-second segments from the front, middle, and back of the clip and concatenates them. A VAE encoder converts that into a latent representation that captures acoustic features while shedding specific melody and rhythm details, which the diffusion stage then reads as conditioning. This is why a very short reference can still steer a longer output, and why a clip with significant silence at the head or tail sometimes produces a result that misses the vibe; the model has been encouraged to reach for the middle of the source.
Vocal-to-BGM. Upload a track with vocals and lyrics, and the model can strip the singing and produce a backing track of the same arrangement. Useful for karaoke prototypes, for re-scoring video underscoring, for any case where you need an instrumental version of a finished track that does not just mute the vocal channel.
Multilingual coverage. Fifty languages is the headline. The practical good-enough-for-production list is closer to fifteen, but that fifteen includes Mandarin, Japanese, Spanish, Hindi, Arabic, and the major European languages. Suno's multilingual support is decent now, but ACE-Step's lyric alignment in Asian languages is noticeably better in side-by-side listening.
Long-form arrangements. ACE-Step generates from ten seconds to ten minutes. Most commercial systems cap around four. If you are scoring a five-minute video underscoring or producing an extended remix, this is rare in the market.
LoRA fine-tuning. You can train a LoRA from a small set of your own songs and capture your style. Typical vocal timbre, harmonic preferences, drum patterns. Suno does not let you do this. Udio does not let you do this. ACE-Step does, and the fine-tuning runs on consumer hardware in a few hours.
What open source actually buys you
This is the part most reviews skip past with a single sentence. The license is the headline, but the practical benefits are downstream of it.
Private inference is one. You can run it on a $400 graphics card and produce a four-minute song in ten seconds. Good enough for an artist who wants to keep their generation private, an indie studio that does not want a monthly subscription on top of every other monthly subscription, or a researcher who needs to inspect the latents.
LoRA fine-tuning is the more interesting one, and worth saying again because it is the structural advantage no commercial system currently offers.
Modifications to the lyric alignment system, custom samplers, alternative diffusion schedules. All of that is available to anyone who wants to ship it. The community has already produced ComfyUI integrations, a continuous-generation RADIO mode, and several alternative front-ends.
The cost of open source is ergonomics. Setting up the local install is non-trivial. The interactive Gradio demo does not always behave well with batch jobs. The team's tutorial is thorough but assumes some familiarity with diffusion model parameters. A lot of producers who would benefit from ACE-Step never get past the install.
Using it without setting up a local install
The audio-to-audio tool on Z.Tools exposes both ACE-Step v1.5 Base, which allows up to 100 denoising steps by default, and ACE-Step v1.5 Turbo, which is capped at 20 steps and is faster and cheaper. Both are pay-per-use, billed per second of generated audio.
A four-minute song on Turbo runs about two and a half cents. On Base, around three and a half cents. No subscription, no monthly minimum, no separate Runware account, no Hugging Face login. Side-by-side with MiniMax Music Cover for the song-to-song use case.
ai-audio-to-audioWhat you give up by not running locally: the ability to inspect intermediate latents, to fine-tune your own LoRA, to modify the diffusion schedule. If you need any of those, the local install is still the right path. The two routes are not mutually exclusive; I use both, and so do most people I know who work with the model.
What you gain: the time and cognitive overhead of not having to babysit a Python environment, and the option to compare ACE-Step against MiniMax Music Cover from the same panel without juggling a second account.
What ACE-Step still cannot do
A few things to set expectations on, because the launch coverage was glowing enough to make people skip the limitations section.
Vocal cloning of a specific named artist is not on the menu, and it should not be. The model does not ship with the kind of artist-specific embedding that some commercial systems quietly trained on. If you want a track that sounds like Drake, ACE-Step is not the model that will get you there, and the legal landscape would refuse to let you ship the result anyway.
Genre-specific weaknesses are real. The team's own paper acknowledges this. The model is strongest in pop, rock, electronic, and the major Asian-language genre families. It is weaker in classical (orchestral arrangements feel synthetic), in jazz (improvised solos lack the rhythmic looseness of a human player), and in metal (the high-gain guitar tones tend to sound a half-step compressed). These are not deal-breakers for most use cases, but they are worth knowing.
Vocal synthesis is the rough edge that most reviewers point at first. Sustained notes occasionally smear; consonants at high BPMs can blur. The intelligibility is good in nine out of ten generations and rough in the tenth. If you are producing a vocal-forward track for a finished release, plan for a manual pass to fix the syllables that did not land.
Editing operations like in-place repaints can produce unnatural transitions at the boundary. The team has a tutorial for handling this with overlapping windows, but it is more fiddly than the headline "instant repaint" suggests.
Random seed sensitivity is unusually high. The same prompt with two different seeds can produce noticeably different results, more so than on a typical commercial system. This is a feature when you want variety and a bug when you have hit a result you like and want to reproduce it. Use the seed value the panel reports if you want to come back to a specific generation.
What the launch actually changed for everyone else
A clear answer: it raised the floor.
Before January 28 of this year, the open-source option for text-to-music was a half-step behind commercial systems on every meaningful axis. After, it is competitive on quality and ahead on speed and cost. Suno and Udio still have the polish lead, the bigger training catalog, and the consumer brand. They are also under existential legal pressure from the major labels. UMG settled with Udio in October 2025. WMG settled with both Udio and Suno in November of the same year. Suno dropped its fair-use defense and now requires opt-in for major-label artists. The open model with a transparent training story has a structural advantage that did not exist a year ago.
If you were waiting to test an open-source music model before committing your workflow, this is the one to test.
Who should test ACE-Step first
A short list:
- Producers who already use Suno and want a cheaper iteration loop. Generating fifty test variations on Suno burns through credits fast. ACE-Step Turbo lets you test fifty variations for under two dollars.
- Indie game and video creators who need long-form scoring over four minutes. ACE-Step's ten-minute ceiling is unusual in this market.
- Multilingual content creators, especially in Mandarin, Japanese, or Hindi. Lyric alignment is meaningfully better than the commercial alternatives in those languages.
- Anyone covering an existing track. The cover-generation mode has the cleanest dial-based control I have used.
A shorter list of who should not bother yet:
- Mainstream pop producers chasing radio polish. Suno v5 still has the cleaner mix. Wait for v1.6 or pair ACE-Step with a mastering pass.
- Anyone who needs vocal style transfer to a specific named artist. The model does not do that, and the legal landscape would not let you ship it anyway.
ACE-Step v1.5 is not the model that puts Suno out of business. It is the model that proves an open-source option can sit at the same table without obviously losing. That is a different thing, and arguably the more important one.
Page Not Found · Z.Tools
The page you're looking for doesn't exist or has been moved.