gpt1B_DPO_model2

This model is a fine-tuned version of AI-Sweden-Models/gpt-sw3-1.3b on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.0068
Rewards/chosen: -0.0291
Rewards/rejected: -6.8852
Rewards/accuracies: 1.0
Rewards/margins: 6.8561
Logps/rejected: -290.5968
Logps/chosen: -127.3574
Logits/rejected: -2.7556
Logits/chosen: -2.9748

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.3903	0.1	25	0.2328	0.1244	-1.3373	0.9933	1.4618	-235.1181	-125.8223	-3.0887	-3.2517
0.0561	0.2	50	0.0585	0.0159	-3.4934	0.9933	3.5094	-256.6789	-126.9073	-2.9094	-3.1004
0.0267	0.3	75	0.0268	-0.0626	-4.9264	0.9967	4.8637	-271.0085	-127.6931	-2.8143	-3.0209
0.0141	0.4	100	0.0175	-0.0535	-5.4979	0.9967	5.4444	-276.7235	-127.6012	-2.7755	-2.9884
0.0105	0.5	125	0.0133	-0.0686	-5.9461	0.9967	5.8775	-281.2056	-127.7524	-2.7592	-2.9752
0.0093	0.6	150	0.0113	-0.0582	-6.1989	0.9967	6.1407	-283.7333	-127.6482	-2.7644	-2.9810
0.007	0.7	175	0.0097	-0.0175	-6.2570	1.0	6.2396	-284.3148	-127.2412	-2.7683	-2.9851
0.0085	0.79	200	0.0083	0.0050	-6.4220	1.0	6.4270	-285.9642	-127.0162	-2.7708	-2.9884
0.0049	0.89	225	0.0079	-0.0124	-6.5942	1.0	6.5818	-287.6865	-127.1910	-2.7644	-2.9830
0.004	0.99	250	0.0076	-0.0282	-6.7093	1.0	6.6811	-288.8376	-127.3483	-2.7587	-2.9779
0.0028	1.09	275	0.0072	-0.0372	-6.7997	1.0	6.7625	-289.7418	-127.4389	-2.7571	-2.9763
0.005	1.19	300	0.0070	-0.0326	-6.8348	1.0	6.8022	-290.0928	-127.3927	-2.7560	-2.9754
0.0038	1.29	325	0.0069	-0.0346	-6.8482	1.0	6.8137	-290.2270	-127.4126	-2.7557	-2.9749
0.004	1.39	350	0.0069	-0.0326	-6.8612	1.0	6.8285	-290.3561	-127.3931	-2.7556	-2.9747
0.0032	1.49	375	0.0069	-0.0328	-6.8697	1.0	6.8370	-290.4420	-127.3942	-2.7557	-2.9750
0.0028	1.59	400	0.0069	-0.0322	-6.8743	1.0	6.8422	-290.4877	-127.3882	-2.7558	-2.9751
0.004	1.69	425	0.0067	-0.0293	-6.8746	1.0	6.8453	-290.4905	-127.3596	-2.7557	-2.9750
0.003	1.79	450	0.0067	-0.0296	-6.8840	1.0	6.8544	-290.5845	-127.3624	-2.7553	-2.9746
0.0028	1.89	475	0.0068	-0.0285	-6.8839	1.0	6.8554	-290.5839	-127.3521	-2.7555	-2.9748
0.0028	1.99	500	0.0068	-0.0291	-6.8852	1.0	6.8561	-290.5968	-127.3574	-2.7556	-2.9748

Framework versions

PEFT 0.8.2
Transformers 4.38.1
Pytorch 2.2.0+cu118
Datasets 2.17.1
Tokenizers 0.15.2

thorirhrafn
/

gpt1B_DPO_model2

gpt1B_DPO_model2

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for thorirhrafn/gpt1B_DPO_model2

Evaluation results