By relying on pretrained and frozen MuLan, we need audioonly data for training the other components of MusicLM. We train SoundStream and w2v-BERT on the Free Music Archive (FMA) dataset (Defferrard et al., 2017), whereas the tokenizers and the autoregressive models for the semantic and acoustic modeling stages are trained on a dataset containing five million audio clips, amounting to 280k hours of music at 24 kHz. Each of the stages is trained with multiple passes over the training data. We use 30 and 10-second random crops of the target audio for the semantic stage and the acoustic stage, respectively. The AudioLM fine acoustic modeling stage is trained on 3-second crops
Not sure what the five million audio clips part is referring to. But they probably actually did this legally?
If this thing is still in the "academic research with no way of generating revenue and no way for the public to get involved" stage, wouldn't it qualify as fair use?
Not sure what the five million audio clips part is referring to. But they probably actually did this legally?
If this thing is still in the "academic research with no way of generating revenue and no way for the public to get involved" stage, wouldn't it qualify as fair use?
Uhh IDK but you're right that they aren't releasing it either.