|
--- |
|
license: mit |
|
datasets: |
|
- M2UGen/MUCaps |
|
- M2UGen/MUEdit |
|
- M2UGen/MUImage |
|
- M2UGen/MUVideo |
|
--- |
|
# M<sup>2</sup>UGen Model with MusicGen-medium |
|
|
|
The M<sup>2</sup>UGen model is a Music Understanding and Generation model that is capable of Music Question Answering and also Music Generation |
|
from texts, images, videos and audios, as well as Music Editing. The model utilizes encoders such as MERT for music understanding, ViT for image understanding |
|
and ViViT for video understanding and the MusicGen/AudioLDM2 model as the music generation model (music decoder), coupled with adapters and the LLaMA 2 model |
|
to make the model possible for multiple abilities. |
|
|
|
M<sup>2</sup>UGen was published in [M<sup>2</sup>UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models](https://arxiv.org/abs/2311.11255) by *Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun and Ying Shan*. |
|
|
|
The code repository for the model is published in [crypto-code/M2UGen](https://github.com/crypto-code/M2UGen). Clone the repository, download the checkpoint and run the following for a model demo: |
|
```bash |
|
python gradio_app.py --model ./ckpts/M2UGen-MusicGen-medium/checkpoint.pth --llama_dir ./ckpts/LLaMA-2 --music_decoder musicgen --music_decoder_path facebook/musicgen-medium |
|
``` |
|
|
|
## Citation |
|
|
|
If you find this model useful, please consider citing: |
|
|
|
```bibtex |
|
@article{hussain2023m, |
|
title={{M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models}}, |
|
author={Hussain, Atin Sakkeer and Liu, Shansong and Sun, Chenshuo and Shan, Ying}, |
|
journal={arXiv preprint arXiv:2311.11255}, |
|
year={2023} |
|
} |
|
``` |