Well, this is gonna get weird.
By relying on pretrained and frozen MuLan, we need audioonly data for training the other components of MusicLM. We train SoundStream and w2v-BERT on the Free Music Archive (FMA) dataset (Defferrard et al., 2017), whereas the tokenizers and the autoregressive models for the semantic and acoustic modeling stages are trained on a dataset containing five million audio clips, amounting to 280k hours of music at 24 kHz. Each of the stages is trained with multiple passes over the training data. We use 30 and 10-second random crops of the target audio for the semantic stage and the acoustic stage, respectively. The AudioLM fine acoustic modeling stage is trained on 3-second crops
Not sure what the five million audio clips part is referring to. But they probably actually did this legally?
If this thing is still in the "academic research with no way of generating revenue and no way for the public to get involved" stage, wouldn't it qualify as fair use?
:cat-vibing:
the hip hop one is like what English sounds like to non-English speakers
New Sims expansion pack soundtrack music just dropped
I used to like hearing AIs try to recreate complete songs from snippets. Now its just eerie.
OH NO THEY AUTOMATED HATSUNE MIKU :commiku:
The vocals are the most obvious weakpoint. Synthesis of human speech and singing is still far off, and this program doesn't write lyrics and instead just smashes phonemes together in a manner far less eloquent than Simlish...