metadata

title: README
emoji: 👀
colorFrom: green
colorTo: yellow
sdk: static
pinned: false

We are a group of people training music LLMs~ 🔥

Introduction to our series work

The development log of our Music Audio Pre-training (m-a-p) model family:

17/03/2023: we release two advanced music understanding models, MERT-v1-95M and MERT-v1-330M , trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks.
14/03/2023: we retrained the MERT-v0 model with open-source-only music dataset MERT-v0-public
29/12/2022: a music understanding model MERT-v0 trained with MLM paradigm, which performs better at downstream tasks.
29/10/2022: a pre-trained MIR model music2vec trained with BYOL paradigm.

Here is a table for quick model pick-up:

Name	Pre-train Paradigm	Training Data (hour)	Pre-train Context (second)	Model Size	Transformer Layer-Dimension	Feature Rate	Sample Rate	Release Date
MERT-v1-330M	MLM	160K	5	330M	24-1024	75 Hz	24K Hz	17/03/2023
MERT-v1-95M	MLM	20K	5	95M	12-768	75 Hz	24K Hz	17/03/2023
MERT-v0-public	MLM	900	5	95M	12-768	50 Hz	16K Hz	14/03/2023
MERT-v0	MLM	1000	5	95 M	12-768	50 Hz	16K Hz	29/12/2022
music2vec-v1	BYOL	1000	30	95 M	12-768	50 Hz	16K Hz	30/10/2022

Explanation

The m-a-p models share the similar model architecture and the most distinguished difference is the paradigm in used pre-training. Other than that, there are several nuance technical configuration needs to know before using:

Model Size: the number of parameters that would be loaded to memory. Please select the appropriate size fitting your hardware.
Transformer Layer-Dimension: The number of transformer layers and the corresponding feature dimensions can be outputted from our model. This is marked out because features extracted by different layers could have various performance depending on tasks.
Feature Rate: Given a 1-second audio input, the number of features output by the model.
Sample Rate: The frequency of audio that the model is trained with.