Edit model card

dpo

This model is a fine-tuned version of microsoft/phi-1_5 on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0000
  • Rewards/chosen: -8.4849
  • Rewards/rejected: -25.9483
  • Rewards/accuracies: 1.0
  • Rewards/margins: 17.4633
  • Logps/rejected: -293.3352
  • Logps/chosen: -152.1862
  • Logits/rejected: -0.9014
  • Logits/chosen: -0.4994

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0005
  • train_batch_size: 4
  • eval_batch_size: 1
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 100
  • training_steps: 2500

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.0318 0.07 100 0.0384 -0.3956 -7.7708 0.9835 7.3753 -111.5607 -71.2923 1.1941 1.0925
0.0187 0.15 200 0.0196 -2.0328 -10.9862 0.9922 8.9535 -143.7145 -87.6645 -0.8539 -0.9067
0.0101 0.22 300 0.0351 -2.7345 -12.1219 0.9896 9.3874 -155.0717 -94.6821 0.4420 0.5220
0.046 0.29 400 0.0199 -6.6027 -18.5556 0.9922 11.9529 -219.4086 -133.3638 -2.3908 -2.0500
0.0005 0.36 500 0.0101 -6.4299 -20.5496 0.9965 14.1197 -239.3484 -131.6356 -1.0029 -0.6334
0.0003 0.44 600 0.0092 -9.0181 -23.0513 0.9965 14.0332 -264.3652 -157.5181 -1.6334 -1.1488
0.0004 0.51 700 0.0043 -5.7377 -21.3127 0.9991 15.5749 -246.9788 -124.7142 -0.8477 -0.4037
0.0001 0.58 800 0.0040 -8.9021 -23.9436 0.9991 15.0415 -273.2885 -156.3581 0.2782 0.8244
0.0001 0.66 900 0.0031 -9.3191 -24.3563 0.9991 15.0371 -277.4149 -160.5282 -0.7279 -0.2168
0.002 0.73 1000 0.0066 -6.8680 -23.5822 0.9974 16.7142 -269.6745 -136.0172 -0.6629 0.2962
0.0002 0.8 1100 0.0015 -9.1417 -27.6276 0.9991 18.4859 -310.1280 -158.7536 -1.2030 -0.5215
0.0823 0.87 1200 0.0057 -4.4568 -18.4378 0.9974 13.9810 -218.2306 -111.9051 0.2236 0.7934
0.0 0.95 1300 0.0171 -8.1530 -25.5603 0.9983 17.4073 -289.4550 -148.8665 -1.2413 -0.9611
0.0007 1.02 1400 0.0019 -7.9402 -25.1905 0.9983 17.2503 -285.7569 -146.7384 -1.2325 -0.8924
0.0002 1.09 1500 0.0010 -8.1543 -25.2960 0.9991 17.1417 -286.8122 -148.8794 -1.0005 -0.6261
0.0 1.17 1600 0.0010 -8.4019 -25.6275 0.9991 17.2256 -290.1275 -151.3556 -1.0850 -0.7170
0.0 1.24 1700 0.0011 -8.8691 -26.2284 0.9991 17.3593 -296.1366 -156.0278 -1.1426 -0.7830
0.0 1.31 1800 0.0010 -9.2896 -26.9277 0.9991 17.6381 -303.1297 -160.2331 -1.1169 -0.7512
0.0001 1.39 1900 0.0011 -9.2869 -26.9301 0.9991 17.6432 -303.1532 -160.2053 -1.1213 -0.7560
0.0 1.46 2000 0.0008 -8.4453 -25.9094 0.9991 17.4641 -292.9459 -151.7894 -0.8854 -0.4791
0.0 1.53 2100 0.0007 -8.4600 -25.9284 0.9991 17.4684 -293.1361 -151.9364 -0.8893 -0.4835
0.0 1.6 2200 0.0000 -8.4501 -25.9071 1.0 17.4569 -292.9228 -151.8381 -0.8823 -0.4759
0.0 1.68 2300 0.0000 -8.4800 -25.9444 1.0 17.4644 -293.2967 -152.1372 -0.8982 -0.4964
0.0 1.75 2400 0.0000 -8.4864 -25.9459 1.0 17.4596 -293.3117 -152.2005 -0.9013 -0.4999
0.0 1.82 2500 0.0000 -8.4849 -25.9483 1.0 17.4633 -293.3352 -152.1862 -0.9014 -0.4994

Framework versions

  • Transformers 4.33.2
  • Pytorch 2.0.1+cu118
  • Datasets 2.14.5
  • Tokenizers 0.13.3
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Model tree for TrevorJS/mtg-dpo-fail

Base model

microsoft/phi-1_5
Finetuned
this model