2021-07-26 00:12:35.575266: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-07-26 00:12:35.575304: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
[00:12:36] - INFO - filelock - Lock 139656499698272 acquired on /home/versae/.cache/huggingface/transformers/27b7e968d2908b27f8c1df265c2dc08aef61be0f25bdc735df4df552829968fd.04a8293889c44bb7f31a5ee6212b8aa0b690121444e9c7ce1616fbe2a461ebba.lock


Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 250M/250M [00:06<00:00, 35.8MB/s]
[00:12:43] - INFO - filelock - Lock 139656499698272 released on /home/versae/.cache/huggingface/transformers/27b7e968d2908b27f8c1df265c2dc08aef61be0f25bdc735df4df552829968fd.04a8293889c44bb7f31a5ee6212b8aa0b690121444e9c7ce1616fbe2a461ebba.lock
/var/hf/venv/lib/python3.8/site-packages/jax/lib/xla_bridge.py:386: UserWarning: jax.host_count has been renamed to jax.process_count. This alias will eventually be removed; please update your code.
  warnings.warn(
/var/hf/venv/lib/python3.8/site-packages/jax/lib/xla_bridge.py:373: UserWarning: jax.host_id has been renamed to jax.process_index. This alias will eventually be removed; please update your code.
  warnings.warn(


Training...:   2%|█▊                                                                                          | 1000/50000 [22:19<17:30:45,  1.29s/it]
Step... (500 | Loss: 1.8920137882232666, Learning Rate: 0.0006000000284984708)
Training...:   2%|█▊                                                                                          | 1000/50000 [22:21<17:30:45,  1.29s/it]


[02:30:54] - INFO - __main__ - Saving checkpoint at 1000 steps██████████████████████████████████████████████████████| 130/130 [00:31<00:00,  4.59it/s]
/var/hf/transformers-orig/src/transformers/modeling_flax_pytorch_utils.py:201: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
  pt_model_dict[flax_key] = torch.from_numpy(flax_tensor)
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (1000/50000 | Loss: 1.7686773538589478, Acc: 0.6487793326377869):   4%|█▏                             | 2000/50000 [45:36<16:04:15,  1.21s/it]
Step... (1500 | Loss: 1.8557080030441284, Learning Rate: 0.0005878788069821894)
Step... (1000/50000 | Loss: 1.7686773538589478, Acc: 0.6487793326377869):   4%|█▏                             | 2000/50000 [45:38<16:04:15,  1.21s/it]


[02:54:02] - INFO - __main__ - Saving checkpoint at 2000 steps██████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (2000/50000 | Loss: 1.778090238571167, Acc: 0.6472830772399902):   6%|█▊                            | 3000/50000 [1:08:36<16:30:25,  1.26s/it]
Evaluating ...:   5%|████▍                                                                                            | 6/130 [00:00<00:07, 16.24it/s]
Step... (2500 | Loss: 1.9601893424987793, Learning Rate: 0.000575757585465908)


[03:16:59] - INFO - __main__ - Saving checkpoint at 3000 steps██████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (3000/50000 | Loss: 1.7852987051010132, Acc: 0.6470173597335815):   8%|██▎                          | 4000/50000 [1:31:22<16:35:30,  1.30s/it]
Step... (3500 | Loss: 1.8832361698150635, Learning Rate: 0.0005636363639496267)
Step... (3000/50000 | Loss: 1.7852987051010132, Acc: 0.6470173597335815):   8%|██▎                          | 4000/50000 [1:31:24<16:35:30,  1.30s/it]


[03:39:47] - INFO - __main__ - Saving checkpoint at 4000 steps██████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (4000/50000 | Loss: 1.776147484779358, Acc: 0.6480115652084351):  10%|███                           | 5000/50000 [1:54:07<16:53:11,  1.35s/it]
Evaluating ...:  11%|██████████▎                                                                                     | 14/130 [00:00<00:07, 15.22it/s]
Step... (4500 | Loss: 1.8291735649108887, Learning Rate: 0.0005515151424333453)


[04:02:30] - INFO - __main__ - Saving checkpoint at 5000 steps██████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (5000/50000 | Loss: 1.7797870635986328, Acc: 0.647495448589325):  12%|███▌                          | 6000/50000 [2:17:21<17:46:48,  1.45s/it]
Step... (5500 | Loss: 1.9027880430221558, Learning Rate: 0.0005393939791247249)
Step... (5000/50000 | Loss: 1.7797870635986328, Acc: 0.647495448589325):  12%|███▌                          | 6000/50000 [2:17:23<17:46:48,  1.45s/it]


[04:25:46] - INFO - __main__ - Saving checkpoint at 6000 steps██████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (6000/50000 | Loss: 1.7780379056930542, Acc: 0.6486639976501465):  14%|████                         | 7000/50000 [2:40:57<15:48:42,  1.32s/it]
Evaluating ...:   0%|                                                                                                         | 0/130 [00:00<?, ?it/s]
Step... (6500 | Loss: 1.835520625114441, Learning Rate: 0.0005272727576084435)


[04:49:21] - INFO - __main__ - Saving checkpoint at 7000 steps██████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (7000/50000 | Loss: 1.767648458480835, Acc: 0.6495990753173828):  16%|████▊                         | 8000/50000 [3:04:03<15:18:38,  1.31s/it]
Step... (7500 | Loss: 1.8483006954193115, Learning Rate: 0.0005151515360921621)
Step... (7000/50000 | Loss: 1.767648458480835, Acc: 0.6495990753173828):  16%|████▊                         | 8000/50000 [3:04:04<15:18:38,  1.31s/it]


[05:12:28] - INFO - __main__ - Saving checkpoint at 8000 steps██████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (8000/50000 | Loss: 1.7662373781204224, Acc: 0.6503182649612427):  18%|█████▏                       | 9000/50000 [3:26:55<14:07:58,  1.24s/it]
Step... (8500 | Loss: 1.8929920196533203, Learning Rate: 0.0005030303145758808)
Step... (9000 | Loss: 1.841712236404419, Learning Rate: 0.0004969697329215705)


[05:35:18] - INFO - __main__ - Saving checkpoint at 9000 steps██████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (9000/50000 | Loss: 1.7518370151519775, Acc: 0.6520029902458191):  20%|█████▌                      | 10000/50000 [3:50:14<14:19:14,  1.29s/it]
Step... (9500 | Loss: 1.8693788051605225, Learning Rate: 0.0004909090930595994)
Step... (9000/50000 | Loss: 1.7518370151519775, Acc: 0.6520029902458191):  20%|█████▌                      | 10000/50000 [3:50:17<14:19:14,  1.29s/it]


[05:58:41] - INFO - __main__ - Saving checkpoint at 10000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (10000/50000 | Loss: 1.7442089319229126, Acc: 0.652866780757904):  22%|██████▏                     | 11000/50000 [4:13:31<14:10:09,  1.31s/it]
Evaluating ...:   5%|████▍                                                                                            | 6/130 [00:00<00:07, 15.82it/s]
Step... (10500 | Loss: 1.7761430740356445, Learning Rate: 0.00047878792975097895)


[06:21:54] - INFO - __main__ - Saving checkpoint at 11000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (11000/50000 | Loss: 1.7415039539337158, Acc: 0.6532756686210632):  24%|██████▍                    | 12000/50000 [4:36:38<14:37:46,  1.39s/it]
Step... (11500 | Loss: 1.8508110046386719, Learning Rate: 0.0004666667082346976)
Step... (11000/50000 | Loss: 1.7415039539337158, Acc: 0.6532756686210632):  24%|██████▍                    | 12000/50000 [4:36:40<14:37:46,  1.39s/it]


[06:45:03] - INFO - __main__ - Saving checkpoint at 12000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (12000/50000 | Loss: 1.7264103889465332, Acc: 0.6554967761039734):  26%|███████                    | 13000/50000 [5:00:28<12:23:03,  1.20s/it]
Step... (12500 | Loss: 1.8441736698150635, Learning Rate: 0.00045454545761458576)
Step... (12000/50000 | Loss: 1.7264103889465332, Acc: 0.6554967761039734):  26%|███████                    | 13000/50000 [5:00:30<12:23:03,  1.20s/it]


[07:08:53] - INFO - __main__ - Saving checkpoint at 13000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (13000/50000 | Loss: 1.725870966911316, Acc: 0.6557744741439819):  28%|███████▊                    | 14000/50000 [5:24:30<15:07:56,  1.51s/it]
Step... (13500 | Loss: 1.8221518993377686, Learning Rate: 0.0004424242360983044)
Step... (14000 | Loss: 1.7394559383392334, Learning Rate: 0.0004363636835478246)


[07:32:53] - INFO - __main__ - Saving checkpoint at 14000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (14000/50000 | Loss: 1.7139594554901123, Acc: 0.6574689745903015):  30%|████████                   | 15000/50000 [5:48:39<11:57:48,  1.23s/it]
Evaluating ...:   2%|█▍                                                                                               | 2/130 [00:00<00:08, 14.51it/s]
Step... (14500 | Loss: 1.901540994644165, Learning Rate: 0.0004303030436858535)


[07:57:02] - INFO - __main__ - Saving checkpoint at 15000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (15000/50000 | Loss: 1.709453821182251, Acc: 0.6586650609970093):  32%|████████▉                   | 16000/50000 [6:13:24<13:18:57,  1.41s/it]
Evaluating ...:   0%|                                                                                                         | 0/130 [00:00<?, ?it/s]
Step... (15500 | Loss: 1.768535852432251, Learning Rate: 0.0004181818221695721)


[08:21:47] - INFO - __main__ - Saving checkpoint at 16000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (16000/50000 | Loss: 1.6991859674453735, Acc: 0.659552812576294):  34%|█████████▌                  | 17000/50000 [6:37:47<11:32:41,  1.26s/it]
Step... (16500 | Loss: 1.7835588455200195, Learning Rate: 0.0004060606297571212)
Step... (17000 | Loss: 1.692732572555542, Learning Rate: 0.00039999998989515007)


[08:46:11] - INFO - __main__ - Saving checkpoint at 17000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (17000/50000 | Loss: 1.6971577405929565, Acc: 0.6604305505752563):  36%|█████████▋                 | 18000/50000 [7:02:23<12:10:56,  1.37s/it]
Step... (17500 | Loss: 1.9012951850891113, Learning Rate: 0.00039393940824083984)
Step... (17000/50000 | Loss: 1.6971577405929565, Acc: 0.6604305505752563):  36%|█████████▋                 | 18000/50000 [7:02:27<12:10:56,  1.37s/it]


[09:10:50] - INFO - __main__ - Saving checkpoint at 18000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (18000/50000 | Loss: 1.6918002367019653, Acc: 0.6613297462463379):  38%|██████████▎                | 19000/50000 [7:26:48<11:49:37,  1.37s/it]
Evaluating ...:   3%|██▉                                                                                              | 4/130 [00:00<00:08, 15.60it/s]
Step... (18500 | Loss: 1.7828737497329712, Learning Rate: 0.00038181818672455847)


[09:35:12] - INFO - __main__ - Saving checkpoint at 19000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (19000/50000 | Loss: 1.6823453903198242, Acc: 0.6625654101371765):  40%|██████████▊                | 20000/50000 [7:51:27<11:12:26,  1.34s/it]
Step... (19500 | Loss: 1.7442021369934082, Learning Rate: 0.0003696969652082771)
Step... (20000 | Loss: 1.6871428489685059, Learning Rate: 0.0003636363835539669)


[09:59:50] - INFO - __main__ - Saving checkpoint at 20000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (20000/50000 | Loss: 1.6746032238006592, Acc: 0.6636187434196472):  42%|███████████▎               | 21000/50000 [8:16:08<11:06:45,  1.38s/it]
Step... (20500 | Loss: 1.8593541383743286, Learning Rate: 0.0003575757727958262)
Step... (20000/50000 | Loss: 1.6746032238006592, Acc: 0.6636187434196472):  42%|███████████▎               | 21000/50000 [8:16:10<11:06:45,  1.38s/it]


[10:24:33] - INFO - __main__ - Saving checkpoint at 21000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (21000/50000 | Loss: 1.669716238975525, Acc: 0.6647850275039673):  44%|████████████▎               | 22000/50000 [8:40:53<13:14:50,  1.70s/it]
Step... (21500 | Loss: 1.764472484588623, Learning Rate: 0.00034545455127954483)
Step... (21000/50000 | Loss: 1.669716238975525, Acc: 0.6647850275039673):  44%|████████████▎               | 22000/50000 [8:40:55<13:14:50,  1.70s/it]


[10:49:19] - INFO - __main__ - Saving checkpoint at 22000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (22000/50000 | Loss: 1.6613430976867676, Acc: 0.6655245423316956):  46%|████████████▍              | 23000/50000 [9:05:24<10:18:31,  1.37s/it]
Evaluating ...:   3%|██▉                                                                                              | 4/130 [00:00<00:08, 14.65it/s]
Step... (22500 | Loss: 1.9999163150787354, Learning Rate: 0.0003333333588670939)


[11:13:47] - INFO - __main__ - Saving checkpoint at 23000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (23000/50000 | Loss: 1.6572293043136597, Acc: 0.6663545966148376):  48%|████████████▉              | 24000/50000 [9:30:12<11:34:04,  1.60s/it]
Step... (23500 | Loss: 1.7666906118392944, Learning Rate: 0.00032121213735081255)
Step... (24000 | Loss: 1.657638430595398, Learning Rate: 0.00031515152659267187)


[11:38:36] - INFO - __main__ - Saving checkpoint at 24000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (24000/50000 | Loss: 1.6508632898330688, Acc: 0.6671841740608215):  50%|█████████████▌             | 25000/50000 [9:54:56<11:05:00,  1.60s/it]
Evaluating ...:   5%|████▍                                                                                            | 6/130 [00:00<00:07, 15.90it/s]
Step... (24500 | Loss: 1.7519614696502686, Learning Rate: 0.0003090909158345312)


[12:03:19] - INFO - __main__ - Saving checkpoint at 25000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (25000/50000 | Loss: 1.6436606645584106, Acc: 0.668701171875):  52%|████████████████               | 26000/50000 [10:20:04<8:52:22,  1.33s/it]
Step... (25500 | Loss: 1.6520822048187256, Learning Rate: 0.0002969697234220803)
Step... (26000 | Loss: 1.7167686223983765, Learning Rate: 0.0002909091126639396)


[12:28:28] - INFO - __main__ - Saving checkpoint at 26000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (26000/50000 | Loss: 1.6362030506134033, Acc: 0.6691190600395203):  54%|██████████████            | 27000/50000 [10:45:19<10:31:50,  1.65s/it]
Step... (26500 | Loss: 1.707963228225708, Learning Rate: 0.0002848485019057989)
Step... (27000 | Loss: 1.7799105644226074, Learning Rate: 0.00027878789114765823)


[12:53:43] - INFO - __main__ - Saving checkpoint at 27000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (27000/50000 | Loss: 1.6304749250411987, Acc: 0.670651376247406):  56%|███████████████▋            | 28000/50000 [11:10:01<8:36:13,  1.41s/it]
Step... (27500 | Loss: 1.8015278577804565, Learning Rate: 0.00027272728038951755)
Step... (27000/50000 | Loss: 1.6304749250411987, Acc: 0.670651376247406):  56%|███████████████▋            | 28000/50000 [11:10:04<8:36:13,  1.41s/it]


[13:18:28] - INFO - __main__ - Saving checkpoint at 28000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (28000/50000 | Loss: 1.627186894416809, Acc: 0.671392560005188):  58%|████████████████▊            | 29000/50000 [11:35:03<8:36:33,  1.48s/it]
Step... (28500 | Loss: 1.738811731338501, Learning Rate: 0.00026060608797706664)
Step... (29000 | Loss: 1.5798612833023071, Learning Rate: 0.00025454547721892595)


[13:43:27] - INFO - __main__ - Saving checkpoint at 29000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (29000/50000 | Loss: 1.6177186965942383, Acc: 0.67269366979599):  60%|█████████████████▍           | 30000/50000 [12:00:13<8:31:46,  1.54s/it]
Evaluating ...:   0%|                                                                                                         | 0/130 [00:00<?, ?it/s]
Step... (29500 | Loss: 1.6591482162475586, Learning Rate: 0.00024848486646078527)


[14:08:36] - INFO - __main__ - Saving checkpoint at 30000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step... (30000/50000 | Loss: 1.6142958402633667, Acc: 0.6730945110321045):  62%|████████████████▋          | 31000/50000 [12:25:40<6:58:17,  1.32s/it]
Evaluating ...:  14%|█████████████▎                                                                                  | 18/130 [00:01<00:07, 15.90it/s]
Step... (30500 | Loss: 1.712971806526184, Learning Rate: 0.00023636363039258868)


[14:34:03] - INFO - __main__ - Saving checkpoint at 31000 steps█████████████████████████████████████████████████████| 130/130 [00:21<00:00,  4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.