Update README.md
Browse files
README.md
CHANGED
@@ -132,6 +132,9 @@ control sample.</caption>
|
|
132 |
<caption>Figure 6. Experimental perplexity distribution of the sampled `mc4-es` after applying `Random` sampling.</caption>
|
133 |
</figure>
|
134 |
|
|
|
|
|
|
|
135 |
We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
|
136 |
|
137 |
Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
|
@@ -204,72 +207,50 @@ For simplicity, we will abbreviate the different models as follows:
|
|
204 |
<figure>
|
205 |
|
206 |
<caption>
|
207 |
-
Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS
|
208 |
</caption>
|
209 |
|
210 |
| Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
|
211 |
-
|
212 |
-
| BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765 |
|
213 |
-
| BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720 |
|
214 |
-
| BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765 |
|
215 |
-
| Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765 |
|
216 |
-
| Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800 |
|
217 |
-
| Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825 |
|
218 |
-
| Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875 |
|
219 |
-
| Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735 |
|
220 |
-
| Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** |
|
221 |
|
222 |
</figure>
|
223 |
|
224 |
In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
|
225 |
|
226 |
-
To note: not intense tuning, epochs, etc. Still, good?? PAWS-X: weird (large differences and repeated base value). Repeated and same, with minor differences.
|
227 |
|
228 |
-
|
229 |
|
230 |
-
|
231 |
-
|
232 |
-
<caption>Table 6. Results for XNLI with sequence length 256 and batch size 32.</caption>
|
233 |
-
|
234 |
-
| Model | Accuracy |
|
235 |
-
|----------------------------------------------------|----------|
|
236 |
-
| bert-base-multilingual-cased | 0.7852 |
|
237 |
-
| dccuchile/bert-base-spanish-wwm-cased | **0.8186** |
|
238 |
-
| BSC-TeMU/roberta-base-bne | 0.8178 |
|
239 |
-
| bertin-project/bertin-base-random | 0.7745 |
|
240 |
-
| bertin-project/bertin-base-stepwise | 0.7820 |
|
241 |
-
| bertin-project/bertin-base-gaussian | 0.7942 |
|
242 |
-
| bertin-project/bertin-base-random-exp-512seqlen | 0.7723 |
|
243 |
-
| bertin-project/bertin-base-gaussian-exp-512seqlen | 0.7878 |
|
244 |
|
245 |
-
|
246 |
|
|
|
247 |
|
248 |
-
|
249 |
-
|
250 |
-
<caption>Table 7. Results for XNLI with sequence length 512 and batch size 16.</caption>
|
251 |
|
252 |
-
|
253 |
-
|----------------------------------------------------|----------|
|
254 |
-
| bert-base-multilingual-cased | WIP |
|
255 |
-
| dccuchile/bert-base-spanish-wwm-cased | WIP |
|
256 |
-
| BSC-TeMU/roberta-base-bne | WIP |
|
257 |
-
| bertin-project/bertin-base-random | WIP |
|
258 |
-
| bertin-project/bertin-base-stepwise | WIP |
|
259 |
-
| bertin-project/bertin-base-gaussian | WIP |
|
260 |
-
| bertin-project/bertin-base-random-exp-512seqlen | 0.7799 |
|
261 |
-
| bertin-project/bertin-base-gaussian-exp-512seqlen | 0.7843 |
|
262 |
|
263 |
-
|
264 |
|
265 |
# Conclusions
|
266 |
|
267 |
-
With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly
|
|
|
|
|
268 |
|
269 |
-
|
270 |
|
271 |
-
|
272 |
-
experimenting with language models training on smaller subsets of huge datasets with reduced training times, since the performance of our models is on par with those trained on big machines for longer times.
|
273 |
|
274 |
## Team members
|
275 |
|
|
|
132 |
<caption>Figure 6. Experimental perplexity distribution of the sampled `mc4-es` after applying `Random` sampling.</caption>
|
133 |
</figure>
|
134 |
|
135 |
+
|
136 |
+
### Training details
|
137 |
+
|
138 |
We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
|
139 |
|
140 |
Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
|
|
|
207 |
<figure>
|
208 |
|
209 |
<caption>
|
210 |
+
Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512).
|
211 |
</caption>
|
212 |
|
213 |
| Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
|
214 |
+
|--------------|-------------------------|----------------------|--------------|-----------------|--------------|
|
215 |
+
| BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765 | 0.7852 | |
|
216 |
+
| BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720 | **0.8186** | |
|
217 |
+
| BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765 | 0.8178 | |
|
218 |
+
| Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765 | | 0.3333 |
|
219 |
+
| Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800 | 0.7745 | 0.7795 |
|
220 |
+
| Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825 | 0.7820 | 0.7799 |
|
221 |
+
| Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875 | 0.7942 | 0.7843 |
|
222 |
+
| Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735 | 0.7723 | 0.7799 |
|
223 |
+
| Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** | 0.7878 | 0.7843 |
|
224 |
|
225 |
</figure>
|
226 |
|
227 |
In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
|
228 |
|
229 |
+
To note: not intense tuning, epochs, etc. Still, good?? PAWS-X: weird (large differences and repeated base value). Repeated and same, with minor differences.Sometimes too short training? XNLI-512, runtime ~19h per model.
|
230 |
|
231 |
+
## Bias and ethics
|
232 |
|
233 |
+
Bananas
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
234 |
|
235 |
+
## Analysis
|
236 |
|
237 |
+
The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
|
238 |
|
239 |
+
Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian-512 is clearly superior, taking the lead in three of the four tasks analysed.
|
|
|
|
|
240 |
|
241 |
+
The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first, while Stepwise is only marginally better than Random. This proves that the sampling technique is, indeed, relevant.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
242 |
|
243 |
+
As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
|
244 |
|
245 |
# Conclusions
|
246 |
|
247 |
+
With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly-curated datasets.
|
248 |
+
|
249 |
+
The very big size of the datasets available looked enticing while formulating the project, however, it soon proved to be an important challenge given time constraints. This lead to a debate within the team and ended up reshaping our project and goals, now focusing on analysing this problem and how we could improve this situation for smaller teams like ours in the future. The subsampling techniques analysed in this report have shown great promise in this regard, and we hope to see other groups use them and improve them in the future.
|
250 |
|
251 |
+
At a personal leve, we agree that the experience has been incredible, and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models, certainly stirring the research community. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.
|
252 |
|
253 |
+
Given our good results, on par with those of large corporations, we hope our work will inspire and set the basis for more small teams to play and experiment with language models on smaller subsets of huge datasets.
|
|
|
254 |
|
255 |
## Team members
|
256 |
|