Safetensors
English
llama
File size: 2,476 Bytes
5e86186
9ea33a8
 
 
 
 
 
5e86186
9ea33a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e86186
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: apache-2.0
datasets:
- openbmb/UltraInteract_pair
language:
- en
base_model: meta-llama/Meta-Llama-3-8B-Instruct
---
This is a model released for our paper: [Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF](https://arxiv.org/abs/2410.04612). 

# REFUEL-Llama-3-Armo-iter_2

This model is developed with REFUEL based on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) as the reward model and [UltraInteract](https://huggingface.co/datasets/openbmb/UltraInteract_pair) dataset.
The training code is available at https://github.com/ZhaolinGao/REFUEL.

## Evaluations

<table>
  <tr>
    <th rowspan="2">Method</th>
    <th rowspan="2">Dataset</th>
    <th colspan="6">Winrate at Turn</th>
  </tr>
  <tr>
    <th>h = 1</th>
    <th>h = 2</th>
    <th>h = 3</th>
    <th>h = 4</th>
    <th>H = 5</th>
    <th>avg</th>
  </tr>
  <tr>
    <td>Llama-3.1-70B-it</td>
    <td> N/A </td>
    <td>70.4</td>
    <td>66.4</td>
    <td>61.0</td>
    <td>53.0</td>
    <td>55.4</td>
    <td>61.24</td>
  </tr>
  <tr>
    <td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_1">REFUEL-Llama-3-Armo-iter_1</a></td>
    <td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_1">REFUEL-Ultrainteract-Llama-3-Armo-iter_1</a></td>
    <td>54.6</td>
    <td>53.6</td>
    <td>57.8</td>
    <td>56.2</td>
    <td>59.4</td>
    <td>56.32</td>
  </tr>
  <tr>
    <td><a href="https://huggingface.co/Cornell-AGI/REFUEL-Llama-3-Armo-iter_2">REFUEL-Llama-3-Armo-iter_2</a></td>
    <td><a href="https://huggingface.co/datasets/Cornell-AGI/REFUEL-Ultrainteract-Llama-3-Armo-iter_2">REFUEL-Ultrainteract-Llama-3-Armo-iter_2</a></td>
    <td>55.2</td>
    <td>53.4</td>
    <td>58.8</td>
    <td>57.2</td>
    <td>58.6</td>
    <td>56.64</td>
  </tr>
</table>

## Citation
Please cite our paper if you use this model in your own work:
```
@misc{gao2024regressingrelativefutureefficient,
      title={Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF}, 
      author={Zhaolin Gao and Wenhao Zhan and Jonathan D. Chang and Gokul Swamy and Kianté Brantley and Jason D. Lee and Wen Sun},
      year={2024},
      eprint={2410.04612},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.04612}, 
}
```