dpo

This model is a fine-tuned version of microsoft/phi-1_5 on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.0000
Rewards/chosen: -8.4849
Rewards/rejected: -25.9483
Rewards/accuracies: 1.0
Rewards/margins: 17.4633
Logps/rejected: -293.3352
Logps/chosen: -152.1862
Logits/rejected: -0.9014
Logits/chosen: -0.4994

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0005
train_batch_size: 4
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
training_steps: 2500

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0318	0.07	100	0.0384	-0.3956	-7.7708	0.9835	7.3753	-111.5607	-71.2923	1.1941	1.0925
0.0187	0.15	200	0.0196	-2.0328	-10.9862	0.9922	8.9535	-143.7145	-87.6645	-0.8539	-0.9067
0.0101	0.22	300	0.0351	-2.7345	-12.1219	0.9896	9.3874	-155.0717	-94.6821	0.4420	0.5220
0.046	0.29	400	0.0199	-6.6027	-18.5556	0.9922	11.9529	-219.4086	-133.3638	-2.3908	-2.0500
0.0005	0.36	500	0.0101	-6.4299	-20.5496	0.9965	14.1197	-239.3484	-131.6356	-1.0029	-0.6334
0.0003	0.44	600	0.0092	-9.0181	-23.0513	0.9965	14.0332	-264.3652	-157.5181	-1.6334	-1.1488
0.0004	0.51	700	0.0043	-5.7377	-21.3127	0.9991	15.5749	-246.9788	-124.7142	-0.8477	-0.4037
0.0001	0.58	800	0.0040	-8.9021	-23.9436	0.9991	15.0415	-273.2885	-156.3581	0.2782	0.8244
0.0001	0.66	900	0.0031	-9.3191	-24.3563	0.9991	15.0371	-277.4149	-160.5282	-0.7279	-0.2168
0.002	0.73	1000	0.0066	-6.8680	-23.5822	0.9974	16.7142	-269.6745	-136.0172	-0.6629	0.2962
0.0002	0.8	1100	0.0015	-9.1417	-27.6276	0.9991	18.4859	-310.1280	-158.7536	-1.2030	-0.5215
0.0823	0.87	1200	0.0057	-4.4568	-18.4378	0.9974	13.9810	-218.2306	-111.9051	0.2236	0.7934
0.0	0.95	1300	0.0171	-8.1530	-25.5603	0.9983	17.4073	-289.4550	-148.8665	-1.2413	-0.9611
0.0007	1.02	1400	0.0019	-7.9402	-25.1905	0.9983	17.2503	-285.7569	-146.7384	-1.2325	-0.8924
0.0002	1.09	1500	0.0010	-8.1543	-25.2960	0.9991	17.1417	-286.8122	-148.8794	-1.0005	-0.6261
0.0	1.17	1600	0.0010	-8.4019	-25.6275	0.9991	17.2256	-290.1275	-151.3556	-1.0850	-0.7170
0.0	1.24	1700	0.0011	-8.8691	-26.2284	0.9991	17.3593	-296.1366	-156.0278	-1.1426	-0.7830
0.0	1.31	1800	0.0010	-9.2896	-26.9277	0.9991	17.6381	-303.1297	-160.2331	-1.1169	-0.7512
0.0001	1.39	1900	0.0011	-9.2869	-26.9301	0.9991	17.6432	-303.1532	-160.2053	-1.1213	-0.7560
0.0	1.46	2000	0.0008	-8.4453	-25.9094	0.9991	17.4641	-292.9459	-151.7894	-0.8854	-0.4791
0.0	1.53	2100	0.0007	-8.4600	-25.9284	0.9991	17.4684	-293.1361	-151.9364	-0.8893	-0.4835
0.0	1.6	2200	0.0000	-8.4501	-25.9071	1.0	17.4569	-292.9228	-151.8381	-0.8823	-0.4759
0.0	1.68	2300	0.0000	-8.4800	-25.9444	1.0	17.4644	-293.2967	-152.1372	-0.8982	-0.4964
0.0	1.75	2400	0.0000	-8.4864	-25.9459	1.0	17.4596	-293.3117	-152.2005	-0.9013	-0.4999
0.0	1.82	2500	0.0000	-8.4849	-25.9483	1.0	17.4633	-293.3352	-152.1862	-0.9014	-0.4994

Framework versions

Transformers 4.33.2
Pytorch 2.0.1+cu118
Datasets 2.14.5
Tokenizers 0.13.3

TrevorJS
/

mtg-dpo-fail

dpo

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for TrevorJS/mtg-dpo-fail

Evaluation results