An Independence-promoting Loss for Music Generation with Language Models

Your Image

  • Our ICML paper is already available on arXiv
  • Code is available on GitHub
  • We aim to release the weights of our 32kHz EnCodec-MMD soon, please bear with us.

Abstract

Music generation schemes using language modeling rely on a vocabulary of audio tokens, mostly provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all codebooks, or fit the product of the codebook marginal distributions. Modelling the joint distribution requires a costly increase in the number of auto-regressive steps, while fitting the product of the marginals yields an inexact model unless the codebooks are mutually independent. In this work, we introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation. The proposed loss is a proxy for mutual information based on the maximum mean discrepancy principle, applied in reproducible kernel Hilbert spaces. Our criterion is simple to implement and train, and it is generalizable to other mulit-stream language models. We show that it reduces the statistical dependence between codebooks during auto-encoding. This leads to an increase in the generated music quality when modelling the product of the marginal distributions, while generating audio much faster than the joint distribution model.

Comparison with baselines

We present here samples from our internal dataset. For each method except Mustango 20 seconds of audio are generated, and 10 seconds are retained based on an offset chosen at random. For Mustango we were only able to generate 10 seconds of audio (the generation task is therefore a little bit easier compared to other baselines).

  • The conditioning is a textual description obtained from the MusicCaps annotation
  • MusicGen-MMD is our proposed method, using the EnCodec-MMD codec with promoted independence between codebooks
  • MusicGen is retrained on the same internal dataset using similar instructions as in [1]
  • AudioLDM [2] is a latent diffusion model using CLAP audio-textual embeddings
  • AudioLDM2-Music is the music-specialized checkpoint of AudioLDM2 [3], which in addition to its predecessor [3] also leverages self-supervised general audio representation
  • Mustango [4] is another latent diffusion model providing fine-grained control over music components such as rythm, chords, etc.

ReferenceMusicGen-MMDMusicGen [1]AudioLDM [2]AudioLDM2-Music [3]Mustango [4]
"Driving and powerful with Hardcore rock elements, featuring gritty electric guitar, bass, drums, and synthesizer, creating a bold and edgy mood."
"Western Classical piece with orchestra."
"Beautiful piano ballad with full string section."
"Pulsing and heroic, featuring lush strings, flowing woodwinds, warm brass, and bright choir that create a noble, triumphant mood."
"Dark and pulsing, with production / film scores epic elements featuring intense electric guitar, synthesizer, and percussion to create a powerful and anticipatory mood."
"Upbeat and energetic, featuring a bright bell melody, electric guitar and acoustic guitar that creates an enthusiastic mood."
"Groovy and trippy, with future bass elements featuring vibrant Rhodes keyboard, synthesizer, bass, and ethnic strings to create a satisfied and feel-good mood."

References