versae

Step... (31000/50000 | Loss: 1.604581594467163, Acc: 0.6744211912155151): 64%|█████████████████▊ | 31807/50000 [12:45:51<7:16:37, 1.44s/it]

26aaf99 about 3 years ago

raw

history blame

No virus

55.3 kB

	2021-07-26 00:12:35.575266: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
	2021-07-26 00:12:35.575304: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
	[00:12:36] - INFO - filelock - Lock 139656499698272 acquired on /home/versae/.cache/huggingface/transformers/27b7e968d2908b27f8c1df265c2dc08aef61be0f25bdc735df4df552829968fd.04a8293889c44bb7f31a5ee6212b8aa0b690121444e9c7ce1616fbe2a461ebba.lock



	Downloading: 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████\| 250M/250M [00:06<00:00, 35.8MB/s]
	[00:12:43] - INFO - filelock - Lock 139656499698272 released on /home/versae/.cache/huggingface/transformers/27b7e968d2908b27f8c1df265c2dc08aef61be0f25bdc735df4df552829968fd.04a8293889c44bb7f31a5ee6212b8aa0b690121444e9c7ce1616fbe2a461ebba.lock
	/var/hf/venv/lib/python3.8/site-packages/jax/lib/xla_bridge.py:386: UserWarning: jax.host_count has been renamed to jax.process_count. This alias will eventually be removed; please update your code.
	warnings.warn(
	/var/hf/venv/lib/python3.8/site-packages/jax/lib/xla_bridge.py:373: UserWarning: jax.host_id has been renamed to jax.process_index. This alias will eventually be removed; please update your code.
	warnings.warn(





































































































































































































































































































































































































































































































































































































































	Training...: 2%\|█▊ \| 1000/50000 [22:19<17:30:45, 1.29s/it]
	Step... (500 \| Loss: 1.8920137882232666, Learning Rate: 0.0006000000284984708)
	Training...: 2%\|█▊ \| 1000/50000 [22:21<17:30:45, 1.29s/it]













	[02:30:54] - INFO - __main__ - Saving checkpoint at 1000 steps██████████████████████████████████████████████████████\| 130/130 [00:31<00:00, 4.59it/s]
	/var/hf/transformers-orig/src/transformers/modeling_flax_pytorch_utils.py:201: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
	pt_model_dict[flax_key] = torch.from_numpy(flax_tensor)
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


































































































































































































































































































































































































































































































































































































































































	Step... (1000/50000 \| Loss: 1.7686773538589478, Acc: 0.6487793326377869): 4%\|█▏ \| 2000/50000 [45:36<16:04:15, 1.21s/it]
	Step... (1500 \| Loss: 1.8557080030441284, Learning Rate: 0.0005878788069821894)
	Step... (1000/50000 \| Loss: 1.7686773538589478, Acc: 0.6487793326377869): 4%\|█▏ \| 2000/50000 [45:38<16:04:15, 1.21s/it]












	[02:54:02] - INFO - __main__ - Saving checkpoint at 2000 steps██████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.59it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.































































































































































































































































































































































































































































































































































































































































	Step... (2000/50000 \| Loss: 1.778090238571167, Acc: 0.6472830772399902): 6%\|█▊ \| 3000/50000 [1:08:36<16:30:25, 1.26s/it]
	Evaluating ...: 5%\|████▍ \| 6/130 [00:00<00:07, 16.24it/s]
	Step... (2500 \| Loss: 1.9601893424987793, Learning Rate: 0.000575757585465908)












	[03:16:59] - INFO - __main__ - Saving checkpoint at 3000 steps██████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
































































































































































































































































































































































































































































































































































































































































	Step... (3000/50000 \| Loss: 1.7852987051010132, Acc: 0.6470173597335815): 8%\|██▎ \| 4000/50000 [1:31:22<16:35:30, 1.30s/it]
	Step... (3500 \| Loss: 1.8832361698150635, Learning Rate: 0.0005636363639496267)
	Step... (3000/50000 \| Loss: 1.7852987051010132, Acc: 0.6470173597335815): 8%\|██▎ \| 4000/50000 [1:31:24<16:35:30, 1.30s/it]












	[03:39:47] - INFO - __main__ - Saving checkpoint at 4000 steps██████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




























































































































































































































































































































































































































































































































































































































































	Step... (4000/50000 \| Loss: 1.776147484779358, Acc: 0.6480115652084351): 10%\|███ \| 5000/50000 [1:54:07<16:53:11, 1.35s/it]
	Evaluating ...: 11%\|██████████▎ \| 14/130 [00:00<00:07, 15.22it/s]
	Step... (4500 \| Loss: 1.8291735649108887, Learning Rate: 0.0005515151424333453)












	[04:02:30] - INFO - __main__ - Saving checkpoint at 5000 steps██████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


























































































































































































































































































































































































































































































































































































































































	Step... (5000/50000 \| Loss: 1.7797870635986328, Acc: 0.647495448589325): 12%\|███▌ \| 6000/50000 [2:17:21<17:46:48, 1.45s/it]
	Step... (5500 \| Loss: 1.9027880430221558, Learning Rate: 0.0005393939791247249)
	Step... (5000/50000 \| Loss: 1.7797870635986328, Acc: 0.647495448589325): 12%\|███▌ \| 6000/50000 [2:17:23<17:46:48, 1.45s/it]












	[04:25:46] - INFO - __main__ - Saving checkpoint at 6000 steps██████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.






































































































































































































































































































































































































































































































































































































































































	Step... (6000/50000 \| Loss: 1.7780379056930542, Acc: 0.6486639976501465): 14%\|████ \| 7000/50000 [2:40:57<15:48:42, 1.32s/it]
	Evaluating ...: 0%\| \| 0/130 [00:00<?, ?it/s]
	Step... (6500 \| Loss: 1.835520625114441, Learning Rate: 0.0005272727576084435)












	[04:49:21] - INFO - __main__ - Saving checkpoint at 7000 steps██████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.59it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

































































































































































































































































































































































































































































































































































































































































	Step... (7000/50000 \| Loss: 1.767648458480835, Acc: 0.6495990753173828): 16%\|████▊ \| 8000/50000 [3:04:03<15:18:38, 1.31s/it]
	Step... (7500 \| Loss: 1.8483006954193115, Learning Rate: 0.0005151515360921621)
	Step... (7000/50000 \| Loss: 1.767648458480835, Acc: 0.6495990753173828): 16%\|████▊ \| 8000/50000 [3:04:04<15:18:38, 1.31s/it]












	[05:12:28] - INFO - __main__ - Saving checkpoint at 8000 steps██████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.59it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.





























































































































































































































































































































































































































































































































































































































































	Step... (8000/50000 \| Loss: 1.7662373781204224, Acc: 0.6503182649612427): 18%\|█████▏ \| 9000/50000 [3:26:55<14:07:58, 1.24s/it]
	Step... (8500 \| Loss: 1.8929920196533203, Learning Rate: 0.0005030303145758808)
	Step... (9000 \| Loss: 1.841712236404419, Learning Rate: 0.0004969697329215705)











	[05:35:18] - INFO - __main__ - Saving checkpoint at 9000 steps██████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.








































































































































































































































































































































































































































































































































































































































































	Step... (9000/50000 \| Loss: 1.7518370151519775, Acc: 0.6520029902458191): 20%\|█████▌ \| 10000/50000 [3:50:14<14:19:14, 1.29s/it]
	Step... (9500 \| Loss: 1.8693788051605225, Learning Rate: 0.0004909090930595994)
	Step... (9000/50000 \| Loss: 1.7518370151519775, Acc: 0.6520029902458191): 20%\|█████▌ \| 10000/50000 [3:50:17<14:19:14, 1.29s/it]












	[05:58:41] - INFO - __main__ - Saving checkpoint at 10000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.59it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




































































































































































































































































































































































































































































































































































































































































	Step... (10000/50000 \| Loss: 1.7442089319229126, Acc: 0.652866780757904): 22%\|██████▏ \| 11000/50000 [4:13:31<14:10:09, 1.31s/it]
	Evaluating ...: 5%\|████▍ \| 6/130 [00:00<00:07, 15.82it/s]
	Step... (10500 \| Loss: 1.7761430740356445, Learning Rate: 0.00047878792975097895)












	[06:21:54] - INFO - __main__ - Saving checkpoint at 11000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


































































































































































































































































































































































































































































































































































































































































	Step... (11000/50000 \| Loss: 1.7415039539337158, Acc: 0.6532756686210632): 24%\|██████▍ \| 12000/50000 [4:36:38<14:37:46, 1.39s/it]
	Step... (11500 \| Loss: 1.8508110046386719, Learning Rate: 0.0004666667082346976)
	Step... (11000/50000 \| Loss: 1.7415039539337158, Acc: 0.6532756686210632): 24%\|██████▍ \| 12000/50000 [4:36:40<14:37:46, 1.39s/it]












	[06:45:03] - INFO - __main__ - Saving checkpoint at 12000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.













































































































































































































































































































































































































































































































































































































































































	Step... (12000/50000 \| Loss: 1.7264103889465332, Acc: 0.6554967761039734): 26%\|███████ \| 13000/50000 [5:00:28<12:23:03, 1.20s/it]
	Step... (12500 \| Loss: 1.8441736698150635, Learning Rate: 0.00045454545761458576)
	Step... (12000/50000 \| Loss: 1.7264103889465332, Acc: 0.6554967761039734): 26%\|███████ \| 13000/50000 [5:00:30<12:23:03, 1.20s/it]












	[07:08:53] - INFO - __main__ - Saving checkpoint at 13000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

























































































































































































































































































































































































































































































































































































































































































	Step... (13000/50000 \| Loss: 1.725870966911316, Acc: 0.6557744741439819): 28%\|███████▊ \| 14000/50000 [5:24:30<15:07:56, 1.51s/it]
	Step... (13500 \| Loss: 1.8221518993377686, Learning Rate: 0.0004424242360983044)
	Step... (14000 \| Loss: 1.7394559383392334, Learning Rate: 0.0004363636835478246)











	[07:32:53] - INFO - __main__ - Saving checkpoint at 14000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.59it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


































































































































































































































































































































































































































































































































































































































































































	Step... (14000/50000 \| Loss: 1.7139594554901123, Acc: 0.6574689745903015): 30%\|████████ \| 15000/50000 [5:48:39<11:57:48, 1.23s/it]
	Evaluating ...: 2%\|█▍ \| 2/130 [00:00<00:08, 14.51it/s]
	Step... (14500 \| Loss: 1.901540994644165, Learning Rate: 0.0004303030436858535)












	[07:57:02] - INFO - __main__ - Saving checkpoint at 15000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.59it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
































































































































































































































































































































































































































































































































































































































































































	Step... (15000/50000 \| Loss: 1.709453821182251, Acc: 0.6586650609970093): 32%\|████████▉ \| 16000/50000 [6:13:24<13:18:57, 1.41s/it]
	Evaluating ...: 0%\| \| 0/130 [00:00<?, ?it/s]
	Step... (15500 \| Loss: 1.768535852432251, Learning Rate: 0.0004181818221695721)












	[08:21:47] - INFO - __main__ - Saving checkpoint at 16000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.







































































































































































































































































































































































































































































































































































































































































































	Step... (16000/50000 \| Loss: 1.6991859674453735, Acc: 0.659552812576294): 34%\|█████████▌ \| 17000/50000 [6:37:47<11:32:41, 1.26s/it]
	Step... (16500 \| Loss: 1.7835588455200195, Learning Rate: 0.0004060606297571212)
	Step... (17000 \| Loss: 1.692732572555542, Learning Rate: 0.00039999998989515007)











	[08:46:11] - INFO - __main__ - Saving checkpoint at 17000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.







































































































































































































































































































































































































































































































































































































































































































	Step... (17000/50000 \| Loss: 1.6971577405929565, Acc: 0.6604305505752563): 36%\|█████████▋ \| 18000/50000 [7:02:23<12:10:56, 1.37s/it]
	Step... (17500 \| Loss: 1.9012951850891113, Learning Rate: 0.00039393940824083984)
	Step... (17000/50000 \| Loss: 1.6971577405929565, Acc: 0.6604305505752563): 36%\|█████████▋ \| 18000/50000 [7:02:27<12:10:56, 1.37s/it]












	[09:10:50] - INFO - __main__ - Saving checkpoint at 18000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.







































































































































































































































































































































































































































































































































































































































































































	Step... (18000/50000 \| Loss: 1.6918002367019653, Acc: 0.6613297462463379): 38%\|██████████▎ \| 19000/50000 [7:26:48<11:49:37, 1.37s/it]
	Evaluating ...: 3%\|██▉ \| 4/130 [00:00<00:08, 15.60it/s]
	Step... (18500 \| Loss: 1.7828737497329712, Learning Rate: 0.00038181818672455847)












	[09:35:12] - INFO - __main__ - Saving checkpoint at 19000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




































































































































































































































































































































































































































































































































































































































































































	Step... (19000/50000 \| Loss: 1.6823453903198242, Acc: 0.6625654101371765): 40%\|██████████▊ \| 20000/50000 [7:51:27<11:12:26, 1.34s/it]
	Step... (19500 \| Loss: 1.7442021369934082, Learning Rate: 0.0003696969652082771)
	Step... (20000 \| Loss: 1.6871428489685059, Learning Rate: 0.0003636363835539669)











	[09:59:50] - INFO - __main__ - Saving checkpoint at 20000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.











































































































































































































































































































































































































































































































































































































































































































	Step... (20000/50000 \| Loss: 1.6746032238006592, Acc: 0.6636187434196472): 42%\|███████████▎ \| 21000/50000 [8:16:08<11:06:45, 1.38s/it]
	Step... (20500 \| Loss: 1.8593541383743286, Learning Rate: 0.0003575757727958262)
	Step... (20000/50000 \| Loss: 1.6746032238006592, Acc: 0.6636187434196472): 42%\|███████████▎ \| 21000/50000 [8:16:10<11:06:45, 1.38s/it]












	[10:24:33] - INFO - __main__ - Saving checkpoint at 21000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.59it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.






































































































































































































































































































































































































































































































































































































































































































	Step... (21000/50000 \| Loss: 1.669716238975525, Acc: 0.6647850275039673): 44%\|████████████▎ \| 22000/50000 [8:40:53<13:14:50, 1.70s/it]
	Step... (21500 \| Loss: 1.764472484588623, Learning Rate: 0.00034545455127954483)
	Step... (21000/50000 \| Loss: 1.669716238975525, Acc: 0.6647850275039673): 44%\|████████████▎ \| 22000/50000 [8:40:55<13:14:50, 1.70s/it]












	[10:49:19] - INFO - __main__ - Saving checkpoint at 22000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.












































































































































































































































































































































































































































































































































































































































































































	Step... (22000/50000 \| Loss: 1.6613430976867676, Acc: 0.6655245423316956): 46%\|████████████▍ \| 23000/50000 [9:05:24<10:18:31, 1.37s/it]
	Evaluating ...: 3%\|██▉ \| 4/130 [00:00<00:08, 14.65it/s]
	Step... (22500 \| Loss: 1.9999163150787354, Learning Rate: 0.0003333333588670939)












	[11:13:47] - INFO - __main__ - Saving checkpoint at 23000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.








































































































































































































































































































































































































































































































































































































































































































	Step... (23000/50000 \| Loss: 1.6572293043136597, Acc: 0.6663545966148376): 48%\|████████████▉ \| 24000/50000 [9:30:12<11:34:04, 1.60s/it]
	Step... (23500 \| Loss: 1.7666906118392944, Learning Rate: 0.00032121213735081255)
	Step... (24000 \| Loss: 1.657638430595398, Learning Rate: 0.00031515152659267187)











	[11:38:36] - INFO - __main__ - Saving checkpoint at 24000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.










































































































































































































































































































































































































































































































































































































































































































	Step... (24000/50000 \| Loss: 1.6508632898330688, Acc: 0.6671841740608215): 50%\|█████████████▌ \| 25000/50000 [9:54:56<11:05:00, 1.60s/it]
	Evaluating ...: 5%\|████▍ \| 6/130 [00:00<00:07, 15.90it/s]
	Step... (24500 \| Loss: 1.7519614696502686, Learning Rate: 0.0003090909158345312)












	[12:03:19] - INFO - __main__ - Saving checkpoint at 25000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.









































































































































































































































































































































































































































































































































































































































































































	Step... (25000/50000 \| Loss: 1.6436606645584106, Acc: 0.668701171875): 52%\|████████████████ \| 26000/50000 [10:20:04<8:52:22, 1.33s/it]
	Step... (25500 \| Loss: 1.6520822048187256, Learning Rate: 0.0002969697234220803)
	Step... (26000 \| Loss: 1.7167686223983765, Learning Rate: 0.0002909091126639396)











	[12:28:28] - INFO - __main__ - Saving checkpoint at 26000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.











































































































































































































































































































































































































































































































































































































































































































	Step... (26000/50000 \| Loss: 1.6362030506134033, Acc: 0.6691190600395203): 54%\|██████████████ \| 27000/50000 [10:45:19<10:31:50, 1.65s/it]
	Step... (26500 \| Loss: 1.707963228225708, Learning Rate: 0.0002848485019057989)
	Step... (27000 \| Loss: 1.7799105644226074, Learning Rate: 0.00027878789114765823)












	[12:53:43] - INFO - __main__ - Saving checkpoint at 27000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.






























































































































































































































































































































































































































































































































































































































































































	Step... (27000/50000 \| Loss: 1.6304749250411987, Acc: 0.670651376247406): 56%\|███████████████▋ \| 28000/50000 [11:10:01<8:36:13, 1.41s/it]
	Step... (27500 \| Loss: 1.8015278577804565, Learning Rate: 0.00027272728038951755)
	Step... (27000/50000 \| Loss: 1.6304749250411987, Acc: 0.670651376247406): 56%\|███████████████▋ \| 28000/50000 [11:10:04<8:36:13, 1.41s/it]












	[13:18:28] - INFO - __main__ - Saving checkpoint at 28000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.












































































































































































































































































































































































































































































































































































































































































































	Step... (28000/50000 \| Loss: 1.627186894416809, Acc: 0.671392560005188): 58%\|████████████████▊ \| 29000/50000 [11:35:03<8:36:33, 1.48s/it]
	Step... (28500 \| Loss: 1.738811731338501, Learning Rate: 0.00026060608797706664)
	Step... (29000 \| Loss: 1.5798612833023071, Learning Rate: 0.00025454547721892595)












	[13:43:27] - INFO - __main__ - Saving checkpoint at 29000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

















































































































































































































































































































































































































































































































































































































































































































	Step... (29000/50000 \| Loss: 1.6177186965942383, Acc: 0.67269366979599): 60%\|█████████████████▍ \| 30000/50000 [12:00:13<8:31:46, 1.54s/it]
	Evaluating ...: 0%\| \| 0/130 [00:00<?, ?it/s]
	Step... (29500 \| Loss: 1.6591482162475586, Learning Rate: 0.00024848486646078527)












	[14:08:36] - INFO - __main__ - Saving checkpoint at 30000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.59it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

















































































































































































































































































































































































































































































































































































































































































































	Step... (30000/50000 \| Loss: 1.6142958402633667, Acc: 0.6730945110321045): 62%\|████████████████▋ \| 31000/50000 [12:25:40<6:58:17, 1.32s/it]
	Evaluating ...: 14%\|█████████████▎ \| 18/130 [00:01<00:07, 15.90it/s]
	Step... (30500 \| Loss: 1.712971806526184, Learning Rate: 0.00023636363039258868)












	[14:34:03] - INFO - __main__ - Saving checkpoint at 31000 steps█████████████████████████████████████████████████████\| 130/130 [00:21<00:00, 4.60it/s]
	All Flax model weights were used when initializing RobertaForMaskedLM.
	Some weights of RobertaForMaskedLM were not initialized from the Flax model and are newly initialized: ['lm_head.decoder.weight', 'roberta.embeddings.position_ids', 'lm_head.decoder.bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.