metadata
library_name: transformers
license: mit
datasets:
- HuggingFaceH4/ultrafeedback_binarized
language:
- en
Llama-3-Base-8B-DICE-Iter1
This model was developed using Bootstrapping Language Models with DPO Implicit Rewards (DICE) at iteration 1, based on the princeton-nlp/Llama-3-Base-8B-SFT-DPO architecture as the starting point.
Links to Other Models
Model Description
- Model type: An 8B parameter GPT-like model fine-tuned on synthetic datasets.
- Language(s) (NLP): Primarily English
- License: MIT
- Fine-tuned from model: princeton-nlp/Llama-3-Base-8B-SFT-DPO
AlpacaEval Leaderboard Evaluation Results
Model | LC. Win Rate | Win Rate |
---|---|---|
Llama-3-Base-8B-SFT-DPO | 18.20 | 15.50 |
Llama-3-Base-8B-DICE-Iter1 | 25.08 | 25.77 |
Llama-3-Base-8B-DICE-Iter2 | 27.55 | 30.99 |
Citation
@article{chen2024bootstrapping,
title={Bootstrapping Language Models with DPO Implicit Rewards},
author={Chen, Changyu and Liu, Zichen and Du, Chao and Pang, Tianyu and Liu, Qian and Sinha, Arunesh and Varakantham, Pradeep and Lin, Min},
journal={arXiv preprint arXiv:2406.09760},
year={2024}
}