File size: 16,588 Bytes
ee86502
 
 
 
a9ade2d
ee86502
 
 
a9ade2d
 
ee86502
 
 
 
36dbc44
ee86502
 
 
 
 
36dbc44
 
 
ee86502
 
 
36dbc44
 
ee86502
 
 
36dbc44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee86502
 
 
36dbc44
 
49634f3
36dbc44
ee86502
 
 
 
 
 
 
 
 
 
 
 
36dbc44
 
 
 
 
 
 
 
ee86502
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06fbee0
 
 
 
9810feb
bc30b56
5693f69
49634f3
9810feb
06fbee0
 
 
 
 
 
 
bc30b56
06fbee0
 
 
 
 
 
 
 
 
 
 
 
 
 
bc30b56
06fbee0
 
 
 
 
 
 
 
 
bc30b56
 
06fbee0
bc30b56
06fbee0
 
 
 
 
49634f3
06fbee0
 
 
 
 
bc30b56
06fbee0
bc30b56
06fbee0
 
bc30b56
06fbee0
 
 
 
bc30b56
06fbee0
 
 
 
 
bc30b56
06fbee0
 
 
 
 
bc30b56
 
06fbee0
 
bc30b56
06fbee0
 
 
 
 
 
bc30b56
06fbee0
 
 
 
 
 
 
 
bc30b56
06fbee0
 
 
 
bc30b56
06fbee0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
925fac7
 
29b3d14
 
 
 
 
 
 
 
bc30b56
29b3d14
 
 
 
 
bc30b56
29b3d14
bc30b56
29b3d14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bc30b56
29b3d14
bc30b56
29b3d14
 
 
 
 
 
 
bc30b56
29b3d14
 
925fac7
 
 
 
 
bc30b56
925fac7
bc30b56
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
---
base_model: gpt2
tags:
- generated_from_trainer
- midi
model-index:
- name: midi_model_3
  results: []
datasets:
- TristanBehrens/js-fakes-4bars
---

# midi_model_3

This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on the js-fakes-4bars dataset.
It achieves the following results on the evaluation set:
- Loss: 0.5542

## Model description

This model generates encoded midi that follows the format of jsfakes chorales.
This representation enables the ability to train traditional language models on midi data.
Also see Magenta [here](https://github.com/magenta/note-seq).

## Intended uses & limitations

For generating basic encoded midi in the jsfakes style, as a proof of concept.
This model is very limited, and shows the ability to train and host this kind of model completely free.

## Training and evaluation data

This model is trained on the js-fakes-4bars dataset, which is a tokenized version of the JS-Fakes dataset by Omar Peracha.

- Link to the original datset [here](https://github.com/omarperacha/js-fakes)
- Link to the tokenized dataset [here](https://huggingface.co/datasets/TristanBehrens/js-fakes-4bars)
- Training set is 4.02k rows
- Test set is 463 rows

The data encodes midi information as encoded text. Here are some examples of what the data looks like:

- PIECE_START (The start of the midi.)
- PIECE_END (The end of the midi.)
- STYLE=JSFAKES (A style tag, which is unused in this dataset.)
- GENRE=JSFAKES (A genre tag, also unused in this dataset.)
- TRACK_START (The start of an instrument's track.)
- TRACK_END (The end of an instrument's track.)
- INST=48 (The instrument the notes will belong to.)
- BAR_START (The start of a musical measure.)
- BAR_END (the end of a musical measure.)
- NOTE_ON=57 (Specifies the note that will start.)
- NOTE_OFF=57 (Specifies the note that will end.)
- TIME_DELTA=4 (How long the note plays for.)

## Training procedure

Training was done through Google Colab's free tier, using a single 15GB Tesla T4 GPU.
Training was logged through Weights and Biases.
A link to the full training notebook can be found [here](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=90KN1wRGWshW)

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0005
- train_batch_size: 4
- eval_batch_size: 2
- seed: 1
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.01
- num_epochs: 10

### Training Statistics

- Total training runtime: 787 seconds (around 13 minutes)
- Training samples per second: 45.91
- Training steps per second: 11.484
- Average GPU watt usage: 66W
- Average GPU temperature: 77C

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 0.8047        | 0.33  | 300  | 0.7969          |
| 0.7924        | 0.66  | 600  | 0.7735          |
| 0.7758        | 1.0   | 900  | 0.7528          |
| 0.75          | 1.33  | 1200 | 0.7436          |
| 0.7432        | 1.66  | 1500 | 0.7277          |
| 0.7361        | 1.99  | 1800 | 0.7175          |
| 0.7121        | 2.32  | 2100 | 0.7025          |
| 0.708         | 2.65  | 2400 | 0.6861          |
| 0.6971        | 2.99  | 2700 | 0.6781          |
| 0.6777        | 3.32  | 3000 | 0.6718          |
| 0.6733        | 3.65  | 3300 | 0.6578          |
| 0.6643        | 3.98  | 3600 | 0.6500          |
| 0.6422        | 4.31  | 3900 | 0.6423          |
| 0.6401        | 4.65  | 4200 | 0.6330          |
| 0.6302        | 4.98  | 4500 | 0.6228          |
| 0.6103        | 5.31  | 4800 | 0.6148          |
| 0.6066        | 5.64  | 5100 | 0.6069          |
| 0.5995        | 5.97  | 5400 | 0.5979          |
| 0.5724        | 6.31  | 5700 | 0.5915          |
| 0.5772        | 6.64  | 6000 | 0.5870          |
| 0.5677        | 6.97  | 6300 | 0.5771          |
| 0.5491        | 7.3   | 6600 | 0.5740          |
| 0.5433        | 7.63  | 6900 | 0.5675          |
| 0.5384        | 7.96  | 7200 | 0.5630          |
| 0.5245        | 8.3   | 7500 | 0.5611          |
| 0.5206        | 8.63  | 7800 | 0.5578          |
| 0.5198        | 8.96  | 8100 | 0.5553          |
| 0.5141        | 9.29  | 8400 | 0.5544          |
| 0.5091        | 9.62  | 8700 | 0.5543          |
| 0.5096        | 9.96  | 9000 | 0.5542          |


### Framework versions

- Transformers 4.35.2
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0


<hr/>

The sections below this point serve as a user guide for the Hugging Face space found [here](https://huggingface.co/spaces/Katpeeler/Midi_space2).

The Google Colab notebook that goes with this can be found [here](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=90KN1wRGWshW)

<hr/>



# Introduction

Midi_space2 allows the user to generate a four-bar musical progression, and listen back to it. 
There are two sections to interact with: audio generation and token generation.

- Audio generation contains 3 sliders:
  - Inst number: a value that adjusts the tonality of the sound.
  - Note number: a value that adjusts the reference pitch the sound is generated from.
  - BPM: the beats per minute, or the speed of the sound.
 
- Token generation is a secondary function, and allows the user to see what the language model generated.
  - Please note that this section will display an "error" if used before any audio is generated.
  - This section shows the tokens that are responsible for the audio you hear in the audio generation section.



## Usage

To run the demo, click on the link [here](https://huggingface.co/spaces/Katpeeler/Midi_space2).

The demo will default to the "audio generation" tab. Here you will find the 3 sliders you can interact with. These are:

- Inst number
- Note number
- BPM

When you have selected values you want to try, click the "generate audio" button at the bottom.
When your audio is ready, you will see the audio waveform displayed within the "audio" box, found above the sliders.
**Note:**
Due to how audio is handled in Google Chrome, you may have to generate the audio a few times when using this demo for the first time.

Additionaly, you may select the "Token Generation" tab, and click the "show generated tokens" button to see the raw text data.



## Documentation

You can view the Google Colab notebook used for training [here](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=90KN1wRGWshW).


- The demo is currently hosted as a Gradio application on Hugging Face Spaces.
- For audio to be heard, we use the soundfile package.

The core components are this gpt-2 model, [js-fakes-4bars dataset](https://huggingface.co/datasets/TristanBehrens/js-fakes-4bars), and [note-seq](https://github.com/magenta/note-seq).
The dataset was created by [Tristan Behrens](https://huggingface.co/TristanBehrens), and is a relatively small size. 
The small size of the dataset made it perfect for training a gpt2 model through the free-tier of Google Colab. I selected this dataset after finding 
a different dataset on Huggingface, [mmm_track_lmd_8bars_nots](https://huggingface.co/datasets/juancopi81/mmm_track_lmd_8bars_nots).
I initally used this dataset, but ran out of free-tier compute resources about 3 hours into training. This setback made me ultimately 
decide to use a smaller dataset for the time being.

- Js-fakes dataset size: 13.7mb, 4,479 rows (The one I actually used)
- Juancopi81 dataset size: 490mb, 177,567 rows (The one I attempted to use first)

For the remainder of this post, we will only discuss the js-fakes dataset.

After downloading, the training split contained 3614 rows, and the test split contained 402 rows. Each entry follows this format:

PIECE_START STYLE=JSFAKES GENRE=JSFAKES TRACK_START INST=48 BAR_START NOTE_ON=70 TIME_DELTA=2 NOTE_OFF=70 NOTE_ON=72 TIME_DELTA=2 NOTE_OFF=72 NOTE_ON=72 TIME_DELTA=2 NOTE_OFF=72 NOTE_ON=70 TIME_DELTA=4 NOTE_OFF=70 NOTE_ON=69 TIME_DELTA=2 NOTE

This data is in a very specific tokenized format, representing the information that is relevant to midi data. Of note:

- NOTE_ON=## : represents the start of a musical note, and which note to play, (A, B, C, etc.)
- TIME_DELTA=4 : represents a quarter note. A half note is represented by TIME_DELTA=8, and an eigth note would be represented by TIME_DELTA=2.
- NOTE_OFF=## : represents the end of a musical note, and which note to end.

These text-based tokens contain the neccessary information to create midi, a standard form of synthesized music data.
The dataset used has already transposed between midi files, and this text-based format.
This format is called "MMM", or Multi-Track Music Machine, proposed in the paper found [here](https://arxiv.org/abs/2008.06048).

**Note:**
I created a tokenizer for this task, and uploaded it to my HuggingFace profile. However, I ended up using the auto-tokenizer from the fine-tuned model, 
so I won't be exploring that further.

I used Tristan Behren's js-fakes-4bars tokenizer to tokenize the dataset for training. I selected a context length of 512, and truncated all text longer than that.
This helped with using limited resources.

The gpt-2 model used was 19.2M parameters. It was trained in steps of 300, through 10 epochs. The model on this page is the third iteration of models, and you can find the first two on my HuggingFace profile.
I ended up using a batch size of 4 to further reduce the VRAM requirements in Google Colab. Specifics for the training can be found at the top of this page, but some fun things to note are:

- Total training runtime: around 13 minutes
- Training samples per second: 45.91
- Training steps per second: 11.484
- Average GPU watt usage: 66W
- Average GPU temperature: 77C

I think it's important to note the power draw of the GPUs used for training models as we enter into this modern era of normalizing this technology.
I obtained those values through [Weights and Biases](https://wandb.ai/site), which I ran alongside my training. 
The training method used is outlined in a blog post by Juancopi81 [here](https://huggingface.co/blog/juancopi81/using-hugging-face-to-train-a-gpt-2-model-for-musi#showcasing-the-model-in-a-%F0%9F%A4%97-space).
While I didn't follow that post exactly, it was of great help when learning how to do this.

The final component to talk about is [Magenta's note_seq library](https://github.com/magenta/note-seq). This is how token sequences are transposed to note sequences and played.
This library is much more powerful than I am implementing, and I plan on expanding this project in the future to incorporate more features.
The main method call for this can be found in the app.py file on the HuggingFace space, but here is a snippet of the code for NOTE_ON:

elif token.startswith("NOTE_ON"):
  pitch = int(token.split("=")[-1])
  note = note_sequence.notes.add()
  note.start_time = current_time
  note.end_time = current_time + 4 * note_length_16th
  note.pitch = pitch
  note.instrument = current_instrument
  note.program = current_program
  note.velocity = 80
  note.is_drum = current_is_drum
  current_notes[pitch] = note

In short, there are instructions for each type of token that is used in the vocabulary, and once you identify what something is supposed to be, 
it can be easily mapped to do whatever we want! Pretty cool, and it supports as many instruments as you want.



## Experiments

There were two other methods considered for this task: a basic n-gram language model, and Meta's Llama-2-70b-chat-hf. 
Both of these methods were accessible, and offer different approaches to this task. Ultimately, neither of these approaches felt appropriate.

Llama-2 struggled to understand the task, and provide consistent results. The main approach was to use prompt engineering to attempt a few-shot generation of the tokenized midi data.
Various initialization prompts were tried, and the following prompt was used:

*You are a midi generator, and only respond with tokens representing midi data. I will provide 3 examples of different songs in an encoded format for you, and then ask you to generate your own encoded midi song.*

This prompt was the only instance where Llama-2 responded with an answer that resembles something correct. Interestingly enough, this prompt resulted in the model explaining the encoded example. An excerpt of that is given below:

*This is a MIDI file containing four tracks, each with its own unique melody and rhythm. Here's a breakdown of each track:
Track 1 (Instrument 0):
This track features a simple melody using the notes C, D, E, F, G, A, and B. The rhythm is mostly quarter notes, with some eighth note pairs and rests.*

However; after this, the model went on a tangent, saying the rest of the examples all played "A, B, C, D, E, F, G" repeatedly, which is incorrect. 
The model was also not asked to explain the examples. I did get a generation in the style of the provided examples after providing about 10 examples, 
but I couldn't get more than 1 generation after that to work. Most responses from Llama went like this:

*The examples you provided use the NoteOn and NoteOff events to represent notes being played and released. In a standard MIDI file, these events would be replaced by the NoteOn and NoteOff commands, which have different values and meanings.*

Of all the attempts, I did get Llama to generate the following:

PIECE_START
STYLE=JSFAKES
GENRE=JSFAKES
TRACK_START
INST=0
BAR_START
NOTE_ON=60 TIME_DELTA=4 NOTE_OFF=60
NOTE_ON=62 TIME_DELTA=4 NOTE_OFF=62
NOTE_ON=64 TIME_DELTA=4 NOTE_OFF=64
NOTE_ON=65 TIME_DELTA=4 NOTE_OFF=65
BAR_END
TRACK_END

Which follows the correct format! However, this "song" is simply 4 notes played as a quarter note, at the same time. This was the result of a "4-bar midi song in the JSFakes style".
Regardless of the prompting used, Llama could not produce an output that matched the criteria, so it was not used for this demo.

The other method, using a basic n-gram model trained on the dataset, performed better. 
This method generates encoded midi data correctly, unlike the Llama-2 model. 
You can find the code for this model in the same [Google Colab notebook](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=jzKXNr4eFrpA) as the training for the gpt-2 model.
This method uses a count-based approach, and can be configured for any number of n-grams. Both bi-gram and tri-gram configurations generate similar results. 
The vocabulary size ends up being 114, which makes sense; the language used for the encoded midi is fairly limited. Some fun things to mention here are:

- TIME_DELTA=4 is the most common n-gram. This makes sense, as most notes are quarter notes in the training data, and this is found almost every time a note is played.
- TIME_DELTA=2 is the second most common. This also makes sense. These are eigth notes
- PIECE_START, PIECE_END, STYLE=JSFAKES, and GENRE=JSFAKES are the least common. These only appear once in each example.

When testing the generations from the n-gram model, most generations sounded exactly the same, with one or two notes changing between generations. 
I'm not entirely sure why this is, but I suspect it has to do with the actual generation method call. I also had a hard time incorporating this model within 
HuggingFace Spaces. The gpt-2 model was easy to upload to the site and use with a few lines of code. The actual generations are also much more diverse, 
making it more enjoyable to mess around with. Between usability and the differences between generations, the gpt-2 model was selected for the demo.



## Limitations

The data this system is trained on does not make use of the "style" or "genre" labels. While they are included in the training examples, they are all filled with null data.
This means the system cannot create generations that are tailored to a particular style/genre of music. Also, the system only plays basic synth tones, 
meaning that we can only hear a simple "chorale" style of music, with little variation. I'd love to explore this further, and expand the system to play various instruments, 
making the generations sound more natural. There is also limited prompting options. A user cannot (easily) provide a melody or starting notes for the generation to be based on.
My idea is to create an interactive "piano" style interface for users, to be able to natrually enter some notes as a basis for the generation. 
Generations are also relavitvely similiar, and I believe this is due to the amount of data trained on.