File size: 4,536 Bytes
5d0d1e5
d5ec288
 
 
37a71b7
d5ec288
 
 
 
 
37a71b7
 
 
 
5d0d1e5
d5ec288
 
 
 
37a71b7
d5ec288
 
 
 
9d26c10
4f94bf4
d5ec288
 
9d26c10
37a71b7
 
 
 
 
4f94bf4
37a71b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e99797
d5ec288
 
 
37a71b7
d5ec288
 
 
37a71b7
d5ec288
5d0d1e5
 
d5ec288
5d0d1e5
37a71b7
 
 
 
 
 
 
 
 
 
 
d5ec288
 
 
 
 
 
 
 
 
 
37a71b7
d5ec288
 
5d0d1e5
d5ec288
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d0d1e5
37a71b7
 
d5ec288
 
 
37a71b7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
base_model: meta-llama/Llama-2-13b-chat-hf
tags:
- generated_from_trainer
- trl
metrics:
- accuracy
model-index:
- name: llama-2-13b-reward-oasst1
  results: []
datasets:
- tasksource/oasst1_pairwise_rlhf_reward
library_name: peft
pipeline_tag: text-classification
---


# llama-2-13b-reward-oasst1

This model is a fine-tuned version of [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) on the [tasksource/oasst1_pairwise_rlhf_reward](https://huggingface.co/datasets/tasksource/oasst1_pairwise_rlhf_reward) dataset.
It achieves the following results on the evaluation set:
- Loss: 0.4810
- Accuracy: 0.7869

See also [vincentmin/llama-2-7b-reward-oasst1](https://huggingface.co/vincentmin/llama-2-7b-reward-oasst1) for a 7b version of this model.

## Model description

This is a reward model trained with QLoRA in 4bit precision. The base model is [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) for which you need to have accepted the license in order to be able use it. Once you've been given permission, you can load the reward model as follows:
```
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer

peft_model_id = "vincentmin/llama-2-13b-reward-oasst1"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    config.base_model_name_or_path,
    num_labels=1,
    load_in_4bit=True,
    torch_dtype=torch.float16,
)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, use_auth_token=True)
model.eval()
with torch.no_grad():
  reward = model(**tokenizer("prompter: hello world. assistant: foo bar", return_tensors='pt')).logits
reward
```
For best results, one should use the prompt format used during training:
```
prompt = "prompter: <prompt_1> assistant: <response_1> prompter: <prompt_2> ..."
```
Please use a version of peft where [#755](https://github.com/huggingface/peft/pull/755) has been merged to make sure the model is loaded correctly. You can install `peft` with `pip install git+https://github.com/huggingface/peft.git` to make sure this is the case.

## Intended uses & limitations

Since the model was trained on oasst1 data, the reward will reflect any biases present in the oasst1 data.

## Training and evaluation data

The model was trained using QLoRA and the `trl` library's `RewardTrainer` on the [tasksource/oasst1_pairwise_rlhf_reward](https://huggingface.co/datasets/tasksource/oasst1_pairwise_rlhf_reward) dataset where examples with more than 512 tokens were filtered out from both the training and eval data.

## Training procedure

### Training hyperparameters

The following `bitsandbytes` quantization config was used during training:
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: False
- bnb_4bit_compute_dtype: float16

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
- max_seq_length: 512

### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| 0.5602        | 0.08  | 250  | 0.5436          | 0.7388   |
| 0.6166        | 0.17  | 500  | 0.5340          | 0.7468   |
| 0.6545        | 0.25  | 750  | 0.4899          | 0.7644   |
| 0.5635        | 0.33  | 1000 | 0.4877          | 0.7532   |
| 0.5933        | 0.42  | 1250 | 0.4930          | 0.7660   |
| 0.5758        | 0.5   | 1500 | 0.4851          | 0.7740   |
| 0.5212        | 0.58  | 1750 | 0.5021          | 0.7788   |
| 0.5251        | 0.67  | 2000 | 0.4893          | 0.7804   |
| 0.5145        | 0.75  | 2250 | 0.4924          | 0.7853   |
| 0.5085        | 0.83  | 2500 | 0.4934          | 0.7853   |
| 0.617         | 0.92  | 2750 | 0.4803          | 0.7821   |
| 0.5525        | 1.0   | 3000 | 0.4810          | 0.7869   |


### Framework versions


- PEFT 0.5.0.dev0 (with https://github.com/huggingface/peft/pull/755)
- Transformers 4.32.0.dev0
- Pytorch 2.0.1+cu118
- Datasets 2.14.0
- Tokenizers 0.13.3