UCLA-AGI
/

Mistral7B-PairRM-SPPO-Iter2

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

angelahzyuan commited on May 6

Commit

5e37a0b

•

1 Parent(s): 9b2b8e0

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ Self-Play Preference Optimization for Language Model Alignment (https://arxiv.or
 # Mistral7B-PairRM-SPPO-Iter1
-This model was developed using [Self-Play Preference Optimization](https://arxiv.org/abs/2405.00675) at iteration 1, based on the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.
 **This is the model reported in the paper** , with K=5 (generate 5 responses per iteration). We attached the Arena-Hard eval results in this model page.

 # Mistral7B-PairRM-SPPO-Iter1
+This model was developed using [Self-Play Preference Optimization](https://arxiv.org/abs/2405.00675) at iteration 2, based on the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.
 **This is the model reported in the paper** , with K=5 (generate 5 responses per iteration). We attached the Arena-Hard eval results in this model page.