Pablogps commited on
Commit
73f4dbe
1 Parent(s): 374bf8f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -14
README.md CHANGED
@@ -143,7 +143,7 @@ We then used the same setup and hyperparameters as [Liu et al. (2019)](https://a
143
 
144
  Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
145
 
146
- For `Random` sampling we trained with seq len 512 during the last 20k steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
147
 
148
  <figure>
149
 
@@ -154,7 +154,7 @@ For `Random` sampling we trained with seq len 512 during the last 20k steps of t
154
 
155
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
157
- Batch size was 2048 for training with 128 sequence length, and 384 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
158
 
159
  ## Results
160
 
@@ -213,20 +213,37 @@ For simplicity, we will abbreviate the different models as follows:
213
  <figure>
214
 
215
  <caption>
216
- Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Results marked with * indicate a repetition.
217
  </caption>
218
 
219
- | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
220
- |--------------|----------------------|---------------------|--------------|----------------|--------------|
221
- | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765* | 0.7852 | 0.7606 |
222
- | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720* | **0.8186** | **0.8012** |
223
- | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765* | 0.8178 | 0.3333* |
224
- | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765* | — | 0.7751* |
225
- | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800* | 0.7745 | 0.7795 |
226
- | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825* | 0.7820 | 0.7799 |
227
- | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875* | 0.7942 | 0.7843 |
228
- | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735* | 0.7723 | 0.7799 |
229
- | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** * | 0.7878 | 0.7843 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
230
 
231
  </figure>
232
 
 
143
 
144
  Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
145
 
146
+ For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
147
 
148
  <figure>
149
 
 
154
 
155
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
157
+ Batch size was 256 for training with 128 sequence length, and 48 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
158
 
159
  ## Results
160
 
 
213
  <figure>
214
 
215
  <caption>
216
+ Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 128. Batch size for XNLI (length 256) is 256. All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs.
217
  </caption>
218
 
219
+ | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
220
+ |--------------|----------------------|---------------------|----------------|
221
+ | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.7852 |
222
+ | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | **0.8186** |
223
+ | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.8178 |
224
+ | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | — |
225
+ | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.7745 |
226
+ | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.7820 |
227
+ | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.7942 |
228
+ | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.7723 |
229
+ | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | 0.7878 |
230
+
231
+ </figure>
232
+
233
+ Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 128. Batch size for XNLI 128 for XNLI (length 512) All models were fine-tuned for 5 epochs. Results marked with * indicate a repetition.
234
+ </caption>
235
+
236
+ | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI (Acc) |
237
+ |--------------|----------------------|---------------------|--------------|------------|
238
+ | BERT-m | 0.9630 / 0.9689 | 0.8616 / 0.9790 | 0.5765* | 0.7606 |
239
+ | BERT-wwm | 0.9639 / 0.9693 | 0.8596 / 0.9790 | 0.8720* | **0.8012** |
240
+ | BSC-BNE | **0.9655 / 0.9706** | 0.8764 / 0.9818 | 0.5765* | 0.3333* |
241
+ | Beta | 0.9616 / 0.9669 | 0.8640 / 0.9799 | 0.5765* | 0.7751* |
242
+ | Random | 0.9651 / 0.9700 | 0.8638 / 0.9802 | 0.8800* | 0.7795 |
243
+ | Stepwise | 0.9642 / 0.9693 | 0.8726 / 0.9818 | 0.8825* | 0.7799 |
244
+ | Gaussian | 0.9644 / 0.9692 | **0.8779 / 0.9820** | 0.8875* | 0.7843 |
245
+ | Random-512 | 0.9636 / 0.9690 | 0.8664 / 0.9806 | 0.6735* | 0.7799 |
246
+ | Gaussian-512 | 0.9646 / 0.9697 | 0.8707 / 0.9810 | **0.8965** * | 0.7843 |
247
 
248
  </figure>
249