Spaces:
Runtime error
Runtime error
release to pypi
Browse files- .gitignore +1 -0
- README.md +74 -60
- setup.cfg +1 -0
.gitignore
CHANGED
@@ -5,6 +5,7 @@ data
|
|
5 |
models
|
6 |
flagged
|
7 |
build
|
|
|
8 |
audiodiffusion.egg-info
|
9 |
lightning_logs
|
10 |
taming
|
|
|
5 |
models
|
6 |
flagged
|
7 |
build
|
8 |
+
dist
|
9 |
audiodiffusion.egg-info
|
10 |
lightning_logs
|
11 |
taming
|
README.md
CHANGED
@@ -11,7 +11,7 @@ license: gpl-3.0
|
|
11 |
---
|
12 |
# audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
|
13 |
|
14 |
-
|
15 |
|
16 |
---
|
17 |
|
@@ -41,7 +41,6 @@ A DDPM is trained on a set of mel spectrograms that have been generated from a d
|
|
41 |
|
42 |
You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
|
43 |
|
44 |
-
|
45 |
| Model | Dataset | Description |
|
46 |
|-------|---------|-------------|
|
47 |
| [teticio/audio-diffusion-256](https://huggingface.co/teticio/audio-diffusion-256) | [teticio/audio-diffusion-256](https://huggingface.co/datasets/teticio/audio-diffusion-256) | My "liked" Spotify playlist |
|
@@ -54,117 +53,132 @@ You can play around with some pre-trained models on [Google Colab](https://colab
|
|
54 |
---
|
55 |
|
56 |
## Generate Mel spectrogram dataset from directory of audio files
|
|
|
57 |
#### Install
|
|
|
58 |
```bash
|
59 |
pip install .
|
60 |
```
|
61 |
|
62 |
-
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results
|
|
|
63 |
```bash
|
64 |
python scripts/audio_to_images.py \
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
```
|
70 |
|
71 |
-
#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`)
|
|
|
72 |
```bash
|
73 |
python scripts/audio_to_images.py \
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
```
|
79 |
|
80 |
Note that the default `sample_rate` is 22050 and audios will be resampled if they are at a different rate. If you change this value, you may find that the results in the `test_mel.ipynb` notebook are not good (for example, if `sample_rate` is 48000) and that it is necessary to adjust `n_fft` (for example, to 2000 instead of the default value of 2048; alternatively, you can resample to a `sample_rate` of 44100). Make sure you use the same parameters for training and inference. You should also bear in mind that not all resolutions work with the neural network architecture as currently configured - you should be safe if you stick to powers of 2.
|
81 |
|
82 |
## Train model
|
83 |
-
|
|
|
|
|
84 |
```bash
|
85 |
accelerate launch --config_file config/accelerate_local.yaml \
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
```
|
97 |
|
98 |
-
#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub
|
|
|
99 |
```bash
|
100 |
accelerate launch --config_file config/accelerate_local.yaml \
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
```
|
115 |
|
116 |
-
#### Run training on SageMaker
|
|
|
117 |
```bash
|
118 |
accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
```
|
129 |
|
130 |
## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
|
|
|
131 |
#### A DDIM can be trained by adding the parameter
|
|
|
132 |
```bash
|
133 |
-
|
134 |
```
|
135 |
|
136 |
Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
|
137 |
|
138 |
## Latent Audio Diffusion
|
|
|
139 |
Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
|
140 |
|
141 |
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
142 |
|
143 |
-
#### Train latent diffusion model using pre-trained VAE
|
|
|
144 |
```bash
|
145 |
accelerate launch ...
|
146 |
-
|
147 |
-
|
148 |
```
|
149 |
|
150 |
-
#### Install dependencies to train with Stable Diffusion
|
151 |
-
|
|
|
152 |
pip install omegaconf pytorch_lightning
|
153 |
pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
|
154 |
pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
|
155 |
```
|
156 |
|
157 |
-
#### Train an autoencoder
|
|
|
158 |
```bash
|
159 |
python scripts/train_vae.py \
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
```
|
164 |
|
165 |
-
#### Train latent diffusion model
|
|
|
166 |
```bash
|
167 |
accelerate launch ...
|
168 |
-
|
169 |
-
|
170 |
```
|
|
|
11 |
---
|
12 |
# audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
|
13 |
|
14 |
+
## Apply diffusion models to synthesize music instead of images using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package
|
15 |
|
16 |
---
|
17 |
|
|
|
41 |
|
42 |
You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
|
43 |
|
|
|
44 |
| Model | Dataset | Description |
|
45 |
|-------|---------|-------------|
|
46 |
| [teticio/audio-diffusion-256](https://huggingface.co/teticio/audio-diffusion-256) | [teticio/audio-diffusion-256](https://huggingface.co/datasets/teticio/audio-diffusion-256) | My "liked" Spotify playlist |
|
|
|
53 |
---
|
54 |
|
55 |
## Generate Mel spectrogram dataset from directory of audio files
|
56 |
+
|
57 |
#### Install
|
58 |
+
|
59 |
```bash
|
60 |
pip install .
|
61 |
```
|
62 |
|
63 |
+
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results
|
64 |
+
|
65 |
```bash
|
66 |
python scripts/audio_to_images.py \
|
67 |
+
--resolution 64,64 \
|
68 |
+
--hop_length 1024 \
|
69 |
+
--input_dir path-to-audio-files \
|
70 |
+
--output_dir path-to-output-data
|
71 |
```
|
72 |
|
73 |
+
#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`)
|
74 |
+
|
75 |
```bash
|
76 |
python scripts/audio_to_images.py \
|
77 |
+
--resolution 256 \
|
78 |
+
--input_dir path-to-audio-files \
|
79 |
+
--output_dir data/audio-diffusion-256 \
|
80 |
+
--push_to_hub teticio/audio-diffusion-256
|
81 |
```
|
82 |
|
83 |
Note that the default `sample_rate` is 22050 and audios will be resampled if they are at a different rate. If you change this value, you may find that the results in the `test_mel.ipynb` notebook are not good (for example, if `sample_rate` is 48000) and that it is necessary to adjust `n_fft` (for example, to 2000 instead of the default value of 2048; alternatively, you can resample to a `sample_rate` of 44100). Make sure you use the same parameters for training and inference. You should also bear in mind that not all resolutions work with the neural network architecture as currently configured - you should be safe if you stick to powers of 2.
|
84 |
|
85 |
## Train model
|
86 |
+
|
87 |
+
#### Run training on local machine
|
88 |
+
|
89 |
```bash
|
90 |
accelerate launch --config_file config/accelerate_local.yaml \
|
91 |
+
scripts/train_unconditional.py \
|
92 |
+
--dataset_name data/audio-diffusion-64 \
|
93 |
+
--hop_length 1024 \
|
94 |
+
--output_dir models/ddpm-ema-audio-64 \
|
95 |
+
--train_batch_size 16 \
|
96 |
+
--num_epochs 100 \
|
97 |
+
--gradient_accumulation_steps 1 \
|
98 |
+
--learning_rate 1e-4 \
|
99 |
+
--lr_warmup_steps 500 \
|
100 |
+
--mixed_precision no
|
101 |
```
|
102 |
|
103 |
+
#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub
|
104 |
+
|
105 |
```bash
|
106 |
accelerate launch --config_file config/accelerate_local.yaml \
|
107 |
+
scripts/train_unconditional.py \
|
108 |
+
--dataset_name teticio/audio-diffusion-256 \
|
109 |
+
--output_dir models/audio-diffusion-256 \
|
110 |
+
--num_epochs 100 \
|
111 |
+
--train_batch_size 2 \
|
112 |
+
--eval_batch_size 2 \
|
113 |
+
--gradient_accumulation_steps 8 \
|
114 |
+
--learning_rate 1e-4 \
|
115 |
+
--lr_warmup_steps 500 \
|
116 |
+
--mixed_precision no \
|
117 |
+
--push_to_hub True \
|
118 |
+
--hub_model_id audio-diffusion-256 \
|
119 |
+
--hub_token $(cat $HOME/.huggingface/token)
|
120 |
```
|
121 |
|
122 |
+
#### Run training on SageMaker
|
123 |
+
|
124 |
```bash
|
125 |
accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
126 |
+
scripts/train_unconditional.py \
|
127 |
+
--dataset_name teticio/audio-diffusion-256 \
|
128 |
+
--output_dir models/ddpm-ema-audio-256 \
|
129 |
+
--train_batch_size 16 \
|
130 |
+
--num_epochs 100 \
|
131 |
+
--gradient_accumulation_steps 1 \
|
132 |
+
--learning_rate 1e-4 \
|
133 |
+
--lr_warmup_steps 500 \
|
134 |
+
--mixed_precision no
|
135 |
```
|
136 |
|
137 |
## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
|
138 |
+
|
139 |
#### A DDIM can be trained by adding the parameter
|
140 |
+
|
141 |
```bash
|
142 |
+
--scheduler ddim
|
143 |
```
|
144 |
|
145 |
Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
|
146 |
|
147 |
## Latent Audio Diffusion
|
148 |
+
|
149 |
Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
|
150 |
|
151 |
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
152 |
|
153 |
+
#### Train latent diffusion model using pre-trained VAE
|
154 |
+
|
155 |
```bash
|
156 |
accelerate launch ...
|
157 |
+
...
|
158 |
+
--vae teticio/latent-audio-diffusion-256
|
159 |
```
|
160 |
|
161 |
+
#### Install dependencies to train with Stable Diffusion
|
162 |
+
|
163 |
+
```bash
|
164 |
pip install omegaconf pytorch_lightning
|
165 |
pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
|
166 |
pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
|
167 |
```
|
168 |
|
169 |
+
#### Train an autoencoder
|
170 |
+
|
171 |
```bash
|
172 |
python scripts/train_vae.py \
|
173 |
+
--dataset_name teticio/audio-diffusion-256 \
|
174 |
+
--batch_size 2 \
|
175 |
+
--gradient_accumulation_steps 12
|
176 |
```
|
177 |
|
178 |
+
#### Train latent diffusion model
|
179 |
+
|
180 |
```bash
|
181 |
accelerate launch ...
|
182 |
+
...
|
183 |
+
--vae models/autoencoder-kl
|
184 |
```
|
setup.cfg
CHANGED
@@ -3,6 +3,7 @@ name = audiodiffusion
|
|
3 |
version = attr: audiodiffusion.VERSION
|
4 |
description = Generate Mel spectrogram dataset from directory of audio files.
|
5 |
long_description = file: README.md
|
|
|
6 |
license = GPL3
|
7 |
classifiers =
|
8 |
Programming Language :: Python :: 3
|
|
|
3 |
version = attr: audiodiffusion.VERSION
|
4 |
description = Generate Mel spectrogram dataset from directory of audio files.
|
5 |
long_description = file: README.md
|
6 |
+
long_description_content_type = text/markdown
|
7 |
license = GPL3
|
8 |
classifiers =
|
9 |
Programming Language :: Python :: 3
|