Safetensors
English
llama
Edit model card

This is a model released for our paper: Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF.

REFUEL-Llama-3-Armo-iter_2

This model is developed with REFUEL based on Meta-Llama-3-8B-Instruct with ArmoRM-Llama3-8B-v0.1 as the reward model and UltraInteract dataset. The training code is available at https://github.com/ZhaolinGao/REFUEL.

Evaluations

Method Dataset Winrate at Turn
h = 1 h = 2 h = 3 h = 4 H = 5 avg
Llama-3.1-70B-it N/A 70.4 66.4 61.0 53.0 55.4 61.24
REFUEL-Llama-3-Armo-iter_1 REFUEL-Ultrainteract-Llama-3-Armo-iter_1 54.6 53.6 57.8 56.2 59.4 56.32
REFUEL-Llama-3-Armo-iter_2 REFUEL-Ultrainteract-Llama-3-Armo-iter_2 55.2 53.4 58.8 57.2 58.6 56.64

Citation

Please cite our paper if you use this model in your own work:

@misc{gao2024regressingrelativefutureefficient,
      title={Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF}, 
      author={Zhaolin Gao and Wenhao Zhan and Jonathan D. Chang and Gokul Swamy and Kianté Brantley and Jason D. Lee and Wen Sun},
      year={2024},
      eprint={2410.04612},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.04612}, 
}
Downloads last month
21
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for Cornell-AGI/REFUEL-Llama-3-Armo-iter_2

Finetuned
(398)
this model

Dataset used to train Cornell-AGI/REFUEL-Llama-3-Armo-iter_2

Collection including Cornell-AGI/REFUEL-Llama-3-Armo-iter_2