versae's picture
Step... (31000/50000 | Loss: 1.604581594467163, Acc: 0.6744211912155151): 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 31807/50000 [12:45:51<7:16:37, 1.44s/it]
26aaf99
raw
history blame
No virus
55.3 kB
2021-07-26 00:12:35.575266: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-07-26 00:12:35.575304: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
[00:12:36] - INFO - filelock - Lock 139656499698272 acquired on /home/versae/.cache/huggingface/transformers/27b7e968d2908b27f8c1df265c2dc08aef61be0f25bdc735df4df552829968fd.04a8293889c44bb7f31a5ee6212b8aa0b690121444e9c7ce1616fbe2a461ebba.lock
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 250M/250M [00:06<00:00, 35.8MB/s]
[00:12:43] - INFO - filelock - Lock 139656499698272 released on /home/versae/.cache/huggingface/transformers/27b7e968d2908b27f8c1df265c2dc08aef61be0f25bdc735df4df552829968fd.04a8293889c44bb7f31a5ee6212b8aa0b690121444e9c7ce1616fbe2a461ebba.lock
/var/hf/venv/lib/python3.8/site-packages/jax/lib/xla_bridge.py:386: UserWarning: jax.host_count has been renamed to jax.process_count. This alias will eventually be removed; please update your code.
warnings.warn(
/var/hf/venv/lib/python3.8/site-packages/jax/lib/xla_bridge.py:373: UserWarning: jax.host_id has been renamed to jax.process_index. This alias will eventually be removed; please update your code.
warnings.warn(
Training...: 2%|β–ˆβ–Š | 1000/50000 [22:19<17:30:45, 1.29s/it]
Step... (500 | Loss: 1.8920137882232666, Learning Rate: 0.0006000000284984708)
Training...: 2%|β–ˆβ–Š | 1000/50000 [22:21<17:30:45, 1.29s/it]
[02:30:54] - INFO - __main__ - Saving checkpoint at 1000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:31<00:00, 4.59it/s]
/var/hf/transformers-orig/src/transformers/modeling_flax_pytorch_utils.py:201: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
pt_model_dict[flax_key] = torch.from_numpy(flax_tensor)
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (1000/50000 | Loss: 1.7686773538589478, Acc: 0.6487793326377869): 4%|β–ˆβ– | 2000/50000 [45:36<16:04:15, 1.21s/it]
Step... (1500 | Loss: 1.8557080030441284, Learning Rate: 0.0005878788069821894)
Step... (1000/50000 | Loss: 1.7686773538589478, Acc: 0.6487793326377869): 4%|β–ˆβ– | 2000/50000 [45:38<16:04:15, 1.21s/it]
[02:54:02] - INFO - __main__ - Saving checkpoint at 2000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (2000/50000 | Loss: 1.778090238571167, Acc: 0.6472830772399902): 6%|β–ˆβ–Š | 3000/50000 [1:08:36<16:30:25, 1.26s/it]
Evaluating ...: 5%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 6/130 [00:00<00:07, 16.24it/s]
Step... (2500 | Loss: 1.9601893424987793, Learning Rate: 0.000575757585465908)
[03:16:59] - INFO - __main__ - Saving checkpoint at 3000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (3000/50000 | Loss: 1.7852987051010132, Acc: 0.6470173597335815): 8%|β–ˆβ–ˆβ–Ž | 4000/50000 [1:31:22<16:35:30, 1.30s/it]
Step... (3500 | Loss: 1.8832361698150635, Learning Rate: 0.0005636363639496267)
Step... (3000/50000 | Loss: 1.7852987051010132, Acc: 0.6470173597335815): 8%|β–ˆβ–ˆβ–Ž | 4000/50000 [1:31:24<16:35:30, 1.30s/it]
[03:39:47] - INFO - __main__ - Saving checkpoint at 4000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (4000/50000 | Loss: 1.776147484779358, Acc: 0.6480115652084351): 10%|β–ˆβ–ˆβ–ˆ | 5000/50000 [1:54:07<16:53:11, 1.35s/it]
Evaluating ...: 11%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 14/130 [00:00<00:07, 15.22it/s]
Step... (4500 | Loss: 1.8291735649108887, Learning Rate: 0.0005515151424333453)
[04:02:30] - INFO - __main__ - Saving checkpoint at 5000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (5000/50000 | Loss: 1.7797870635986328, Acc: 0.647495448589325): 12%|β–ˆβ–ˆβ–ˆβ–Œ | 6000/50000 [2:17:21<17:46:48, 1.45s/it]
Step... (5500 | Loss: 1.9027880430221558, Learning Rate: 0.0005393939791247249)
Step... (5000/50000 | Loss: 1.7797870635986328, Acc: 0.647495448589325): 12%|β–ˆβ–ˆβ–ˆβ–Œ | 6000/50000 [2:17:23<17:46:48, 1.45s/it]
[04:25:46] - INFO - __main__ - Saving checkpoint at 6000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (6000/50000 | Loss: 1.7780379056930542, Acc: 0.6486639976501465): 14%|β–ˆβ–ˆβ–ˆβ–ˆ | 7000/50000 [2:40:57<15:48:42, 1.32s/it]
Evaluating ...: 0%| | 0/130 [00:00<?, ?it/s]
Step... (6500 | Loss: 1.835520625114441, Learning Rate: 0.0005272727576084435)
[04:49:21] - INFO - __main__ - Saving checkpoint at 7000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (7000/50000 | Loss: 1.767648458480835, Acc: 0.6495990753173828): 16%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 8000/50000 [3:04:03<15:18:38, 1.31s/it]
Step... (7500 | Loss: 1.8483006954193115, Learning Rate: 0.0005151515360921621)
Step... (7000/50000 | Loss: 1.767648458480835, Acc: 0.6495990753173828): 16%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 8000/50000 [3:04:04<15:18:38, 1.31s/it]
[05:12:28] - INFO - __main__ - Saving checkpoint at 8000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (8000/50000 | Loss: 1.7662373781204224, Acc: 0.6503182649612427): 18%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 9000/50000 [3:26:55<14:07:58, 1.24s/it]
Step... (8500 | Loss: 1.8929920196533203, Learning Rate: 0.0005030303145758808)
Step... (9000 | Loss: 1.841712236404419, Learning Rate: 0.0004969697329215705)
[05:35:18] - INFO - __main__ - Saving checkpoint at 9000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (9000/50000 | Loss: 1.7518370151519775, Acc: 0.6520029902458191): 20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 10000/50000 [3:50:14<14:19:14, 1.29s/it]
Step... (9500 | Loss: 1.8693788051605225, Learning Rate: 0.0004909090930595994)
Step... (9000/50000 | Loss: 1.7518370151519775, Acc: 0.6520029902458191): 20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 10000/50000 [3:50:17<14:19:14, 1.29s/it]
[05:58:41] - INFO - __main__ - Saving checkpoint at 10000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (10000/50000 | Loss: 1.7442089319229126, Acc: 0.652866780757904): 22%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 11000/50000 [4:13:31<14:10:09, 1.31s/it]
Evaluating ...: 5%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 6/130 [00:00<00:07, 15.82it/s]
Step... (10500 | Loss: 1.7761430740356445, Learning Rate: 0.00047878792975097895)
[06:21:54] - INFO - __main__ - Saving checkpoint at 11000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (11000/50000 | Loss: 1.7415039539337158, Acc: 0.6532756686210632): 24%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 12000/50000 [4:36:38<14:37:46, 1.39s/it]
Step... (11500 | Loss: 1.8508110046386719, Learning Rate: 0.0004666667082346976)
Step... (11000/50000 | Loss: 1.7415039539337158, Acc: 0.6532756686210632): 24%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 12000/50000 [4:36:40<14:37:46, 1.39s/it]
[06:45:03] - INFO - __main__ - Saving checkpoint at 12000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (12000/50000 | Loss: 1.7264103889465332, Acc: 0.6554967761039734): 26%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 13000/50000 [5:00:28<12:23:03, 1.20s/it]
Step... (12500 | Loss: 1.8441736698150635, Learning Rate: 0.00045454545761458576)
Step... (12000/50000 | Loss: 1.7264103889465332, Acc: 0.6554967761039734): 26%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 13000/50000 [5:00:30<12:23:03, 1.20s/it]
[07:08:53] - INFO - __main__ - Saving checkpoint at 13000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (13000/50000 | Loss: 1.725870966911316, Acc: 0.6557744741439819): 28%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 14000/50000 [5:24:30<15:07:56, 1.51s/it]
Step... (13500 | Loss: 1.8221518993377686, Learning Rate: 0.0004424242360983044)
Step... (14000 | Loss: 1.7394559383392334, Learning Rate: 0.0004363636835478246)
[07:32:53] - INFO - __main__ - Saving checkpoint at 14000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (14000/50000 | Loss: 1.7139594554901123, Acc: 0.6574689745903015): 30%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 15000/50000 [5:48:39<11:57:48, 1.23s/it]
Evaluating ...: 2%|β–ˆβ– | 2/130 [00:00<00:08, 14.51it/s]
Step... (14500 | Loss: 1.901540994644165, Learning Rate: 0.0004303030436858535)
[07:57:02] - INFO - __main__ - Saving checkpoint at 15000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (15000/50000 | Loss: 1.709453821182251, Acc: 0.6586650609970093): 32%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 16000/50000 [6:13:24<13:18:57, 1.41s/it]
Evaluating ...: 0%| | 0/130 [00:00<?, ?it/s]
Step... (15500 | Loss: 1.768535852432251, Learning Rate: 0.0004181818221695721)
[08:21:47] - INFO - __main__ - Saving checkpoint at 16000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (16000/50000 | Loss: 1.6991859674453735, Acc: 0.659552812576294): 34%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 17000/50000 [6:37:47<11:32:41, 1.26s/it]
Step... (16500 | Loss: 1.7835588455200195, Learning Rate: 0.0004060606297571212)
Step... (17000 | Loss: 1.692732572555542, Learning Rate: 0.00039999998989515007)
[08:46:11] - INFO - __main__ - Saving checkpoint at 17000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (17000/50000 | Loss: 1.6971577405929565, Acc: 0.6604305505752563): 36%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 18000/50000 [7:02:23<12:10:56, 1.37s/it]
Step... (17500 | Loss: 1.9012951850891113, Learning Rate: 0.00039393940824083984)
Step... (17000/50000 | Loss: 1.6971577405929565, Acc: 0.6604305505752563): 36%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 18000/50000 [7:02:27<12:10:56, 1.37s/it]
[09:10:50] - INFO - __main__ - Saving checkpoint at 18000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (18000/50000 | Loss: 1.6918002367019653, Acc: 0.6613297462463379): 38%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 19000/50000 [7:26:48<11:49:37, 1.37s/it]
Evaluating ...: 3%|β–ˆβ–ˆβ–‰ | 4/130 [00:00<00:08, 15.60it/s]
Step... (18500 | Loss: 1.7828737497329712, Learning Rate: 0.00038181818672455847)
[09:35:12] - INFO - __main__ - Saving checkpoint at 19000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (19000/50000 | Loss: 1.6823453903198242, Acc: 0.6625654101371765): 40%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 20000/50000 [7:51:27<11:12:26, 1.34s/it]
Step... (19500 | Loss: 1.7442021369934082, Learning Rate: 0.0003696969652082771)
Step... (20000 | Loss: 1.6871428489685059, Learning Rate: 0.0003636363835539669)
[09:59:50] - INFO - __main__ - Saving checkpoint at 20000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (20000/50000 | Loss: 1.6746032238006592, Acc: 0.6636187434196472): 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 21000/50000 [8:16:08<11:06:45, 1.38s/it]
Step... (20500 | Loss: 1.8593541383743286, Learning Rate: 0.0003575757727958262)
Step... (20000/50000 | Loss: 1.6746032238006592, Acc: 0.6636187434196472): 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 21000/50000 [8:16:10<11:06:45, 1.38s/it]
[10:24:33] - INFO - __main__ - Saving checkpoint at 21000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (21000/50000 | Loss: 1.669716238975525, Acc: 0.6647850275039673): 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 22000/50000 [8:40:53<13:14:50, 1.70s/it]
Step... (21500 | Loss: 1.764472484588623, Learning Rate: 0.00034545455127954483)
Step... (21000/50000 | Loss: 1.669716238975525, Acc: 0.6647850275039673): 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 22000/50000 [8:40:55<13:14:50, 1.70s/it]
[10:49:19] - INFO - __main__ - Saving checkpoint at 22000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (22000/50000 | Loss: 1.6613430976867676, Acc: 0.6655245423316956): 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 23000/50000 [9:05:24<10:18:31, 1.37s/it]
Evaluating ...: 3%|β–ˆβ–ˆβ–‰ | 4/130 [00:00<00:08, 14.65it/s]
Step... (22500 | Loss: 1.9999163150787354, Learning Rate: 0.0003333333588670939)
[11:13:47] - INFO - __main__ - Saving checkpoint at 23000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (23000/50000 | Loss: 1.6572293043136597, Acc: 0.6663545966148376): 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 24000/50000 [9:30:12<11:34:04, 1.60s/it]
Step... (23500 | Loss: 1.7666906118392944, Learning Rate: 0.00032121213735081255)
Step... (24000 | Loss: 1.657638430595398, Learning Rate: 0.00031515152659267187)
[11:38:36] - INFO - __main__ - Saving checkpoint at 24000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (24000/50000 | Loss: 1.6508632898330688, Acc: 0.6671841740608215): 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 25000/50000 [9:54:56<11:05:00, 1.60s/it]
Evaluating ...: 5%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 6/130 [00:00<00:07, 15.90it/s]
Step... (24500 | Loss: 1.7519614696502686, Learning Rate: 0.0003090909158345312)
[12:03:19] - INFO - __main__ - Saving checkpoint at 25000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (25000/50000 | Loss: 1.6436606645584106, Acc: 0.668701171875): 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 26000/50000 [10:20:04<8:52:22, 1.33s/it]
Step... (25500 | Loss: 1.6520822048187256, Learning Rate: 0.0002969697234220803)
Step... (26000 | Loss: 1.7167686223983765, Learning Rate: 0.0002909091126639396)
[12:28:28] - INFO - __main__ - Saving checkpoint at 26000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (26000/50000 | Loss: 1.6362030506134033, Acc: 0.6691190600395203): 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 27000/50000 [10:45:19<10:31:50, 1.65s/it]
Step... (26500 | Loss: 1.707963228225708, Learning Rate: 0.0002848485019057989)
Step... (27000 | Loss: 1.7799105644226074, Learning Rate: 0.00027878789114765823)
[12:53:43] - INFO - __main__ - Saving checkpoint at 27000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (27000/50000 | Loss: 1.6304749250411987, Acc: 0.670651376247406): 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 28000/50000 [11:10:01<8:36:13, 1.41s/it]
Step... (27500 | Loss: 1.8015278577804565, Learning Rate: 0.00027272728038951755)
Step... (27000/50000 | Loss: 1.6304749250411987, Acc: 0.670651376247406): 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 28000/50000 [11:10:04<8:36:13, 1.41s/it]
[13:18:28] - INFO - __main__ - Saving checkpoint at 28000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (28000/50000 | Loss: 1.627186894416809, Acc: 0.671392560005188): 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 29000/50000 [11:35:03<8:36:33, 1.48s/it]
Step... (28500 | Loss: 1.738811731338501, Learning Rate: 0.00026060608797706664)
Step... (29000 | Loss: 1.5798612833023071, Learning Rate: 0.00025454547721892595)
[13:43:27] - INFO - __main__ - Saving checkpoint at 29000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (29000/50000 | Loss: 1.6177186965942383, Acc: 0.67269366979599): 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 30000/50000 [12:00:13<8:31:46, 1.54s/it]
Evaluating ...: 0%| | 0/130 [00:00<?, ?it/s]
Step... (29500 | Loss: 1.6591482162475586, Learning Rate: 0.00024848486646078527)
[14:08:36] - INFO - __main__ - Saving checkpoint at 30000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.59it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Step... (30000/50000 | Loss: 1.6142958402633667, Acc: 0.6730945110321045): 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 31000/50000 [12:25:40<6:58:17, 1.32s/it]
Evaluating ...: 14%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 18/130 [00:01<00:07, 15.90it/s]
Step... (30500 | Loss: 1.712971806526184, Learning Rate: 0.00023636363039258868)
[14:34:03] - INFO - __main__ - Saving checkpoint at 31000 stepsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 130/130 [00:21<00:00, 4.60it/s]
All Flax model weights were used when initializing RobertaForMaskedLM.
Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.