File size: 3,541 Bytes
b7ba939 0c324fb b7ba939 0c324fb b7ba939 0c324fb b7ba939 0c324fb b7ba939 0c324fb b7ba939 0c324fb b7ba939 0c324fb b7ba939 0c324fb 2076437 b7ba939 0c324fb b7ba939 0c324fb b7ba939 0c324fb 2076437 b7ba939 0c324fb b7ba939 0c324fb b7ba939 0c324fb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
license: apache-2.0
datasets:
- openbmb/UltraFeedback
language:
- en
pipeline_tag: text-generation
---
Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)
# Llama-3-Instruct-8B-SPPO-Iter1
This model was developed using [Self-Play Preference Optimization](https://arxiv.org/abs/2405.00675) at iteration 1, based on the [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.
## Links to Other Models
- [Llama-3-Instruct-8B-SPPO-Iter1](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter1)
- [Llama-3-Instruct-8B-SPPO-Iter2](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter2)
- [Llama-3-Instruct-8B-SPPO-Iter3](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3)
### Model Description
- Model type: A 8B parameter GPT-like model fine-tuned on synthetic datasets.
- Language(s) (NLP): Primarily English
- License: Apache-2.0
- Finetuned from model: meta-llama/Meta-Llama-3-8B-Instruct
## [AlpacaEval Leaderboard Evaluation Results](https://tatsu-lab.github.io/alpaca_eval/)
| Model | LC. Win Rate | Win Rate | Avg. Length |
|-------------------------------------------|:------------:|:--------:|:-----------:|
|[Llama-3-8B-SPPO Iter1](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter1) |31.73 |31.74 | 1962
|[Llama-3-8B-SPPO Iter2](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter2) |35.15 |35.98 | 2021
|[Llama-3-8B-SPPO Iter3](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3) |**38.77** |**39.85** | 2066
## [Open LLM Leaderboard Evaluation Results](https://github.com/EleutherAI/lm-evaluation-harness)
Results are reported by using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.1
| | arc_challenge | truthfulqa_mc2 | winogrande | gsm8k | hellaswag | mmlu | average |
|--------|---------------|----------------|------------|-------|-----------|-------|---------|
|[Llama-3-8B-SPPO Iter1](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter1) | 63.82 | 54.96 | 76.40 | 75.44 | 79.80 | 65.65 | 69.35
|[Llama-3-8B-SPPO Iter2](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter2) | 64.93 | 56.48 | 76.87 | 75.13 | 80.39 | 65.67 | 69.91
|[Llama-3-8B-SPPO Iter3](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3) | 65.19 | 58.04 | 77.11 | 74.91 | 80.86 | 65.60 | **70.29**
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-07
- eta: 1000
- per_device_train_batch_size: 8
- gradient_accumulation_steps: 1
- seed: 42
- distributed_type: deepspeed_zero3
- num_devices: 8
- optimizer: RMSProp
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_train_epochs: 6.0 (stop at epoch=1.0)
## Citation
```
@misc{wu2024self,
title={Self-Play Preference Optimization for Language Model Alignment},
author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
year={2024},
eprint={2405.00675},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
``` |