bertin-project
/

bertin-roberta-base-spanish

@@ -118,13 +118,24 @@ for split in ("random", "stepwise", "gaussian"):
 <caption>Figure 6. Experimental perplexity distribution of the sampled `mc4-es` after applying `Random` sampling.</caption>
 </figure>
-We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. Then, we continued training the most promising model for 25k more on sequence length 512.
-**MENTION TWO WAYS TO CONTINUE TRAINING ON 512 AND SHOW DIFFERENCE IN PERFORMANCE, DO WE HAVE A GRAPH FOR THIS?** .
 ## Results
-Our first test, tagged `beta` in this repository, refers to an initial experiment using `stepwise` on 128 sequence lengths but a small `factor` to oversample everything. During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. In all our experiments and procedures, we had access to 3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
 Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.

 <caption>Figure 6. Experimental perplexity distribution of the sampled `mc4-es` after applying `Random` sampling.</caption>
 </figure>
+We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
+Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature.
+For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
+<figure>
+![](./images/random_512.jpg)
+<caption>Figure 7. Training profile for Random sampling. Note the drop in performance after the change from 128 to 512 sequence lenght.</caption>
+</figure>
+For `Gaussian` sampling we started a new optimizer after 230 steps with 128 seq len, using a short warmup interval. Results are much better (we do not have a graph since training needed to be restarted several times).
 ## Results
+Our first test, tagged `beta` in this repository, refers to an initial experiment using `Stepwise` on 128 sequence lengths but a small `factor` to oversample everything. During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. In all our experiments and procedures, we had access to 3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
 Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.