File size: 12,201 Bytes
b9c14d1
 
 
 
 
ed27633
 
 
 
 
b9c14d1
 
 
ed27633
 
 
 
 
 
 
 
 
 
 
 
 
eacc43b
 
 
 
 
ed27633
 
 
 
 
 
 
 
 
 
 
 
 
062d453
 
ed27633
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2031406
ed27633
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9c14d1
 
 
062d453
 
 
 
 
2031406
eacc43b
5fe1b83
 
 
eacc43b
 
 
 
 
 
062d453
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2031406
062d453
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2031406
062d453
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0488c7
 
 
2031406
 
 
b9c14d1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
---
library_name: transformers
tags: []
---

# Fine-tune Llama 3 with ORPO 

ORPO is a new exciting fine-tuning technique that combines the traditional supervised fine-tuning and preference alignment stages into a single process. This reduces the computational resources and time required for training. Moreover, empirical results demonstrate that ORPO outperforms other alignment methods on various model sizes and benchmarks.

In this article, we will fine-tune the new Llama 3 8B model using ORPO with the TRL library. 

<!-- Provide a quick summary of what the model is/does. -->

## ORPO
Instruction tuning and preference alignment are essential techniques for adapting Large Language Models (LLMs) to specific tasks. Traditionally, this involves a multi-stage process: 1/ Supervised Fine-Tuning (SFT) on instructions to adapt the model to the target domain, followed by 2/ preference alignment methods like Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) to increase the likelihood of generating preferred responses over rejected ones.

However, researchers have identified a limitation in this approach. While SFT effectively adapts the model to the desired domain, it inadvertently increases the probability of generating undesirable answers alongside preferred ones. This is why the preference alignment stage is necessary to widen the gap between the likelihoods of preferred and rejected outputs.

see more on ORPO [link](https://arxiv.org/abs/2403.07691)

## Fine-tuning Llama 3 with ORPO 

[Llama 3](https://github.com/meta-llama/llama3/tree/main) is the latest family of LLMs developed by Meta. The models were trained on an extensive dataset of 15 trillion tokens (compared to 2T tokens for Llama 2). Two model sizes have been released: a 70 billion parameter model and a smaller 8 billion parameter model. The 70B model has already demonstrated impressive performance, scoring 82 on the MMLU benchmark and 81.7 on the HumanEval benchmark.

Llama 3 models also increased the context length up to 8,192 tokens (4,096 tokens for Llama 2), and potentially scale up to 32k with RoPE. Additionally, the models use a new tokenizer with a 128K-token vocabulary, reducing the number of tokens required to encode text by 15%. This vocabulary also explains the bump from 7B to 8B parameters.


## Hardware:

- I used a Nvidia-A100 80GB GPU. Note that you need a good GPU for training and testing on lower GPU will not work!

# Required packages
```bash
pip install -U transformers datasets accelerate peft trl bitsandbytes wandb
pip install -qqq flash-attn
pip install -qU transformers accelerate
```

Once it's installed, we can import the necessary libraries and log in to W&B (optional):

```python

"""
wandb
https://wandb.ai/wandb_account
you need wb_token as well
"""

import gc
import os

import torch
import wandb
from datasets import load_dataset



# Directly insert your Weights & Biases API key here
wb_token = 'your_wb_token'
wandb.login(key=wb_token)


from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,)

from trl import ORPOConfig, ORPOTrainer, setup_chat_format
```

If you have a recent GPU, you should also be able to use the Flash Attention library to replace the default eager attention implementation with a more efficient one.


```python

if torch.cuda.get_device_capability()[0] >= 128:
    
    attn_implementation = "flash_attention_2"
    torch_dtype = torch.bfloat16
else:
    attn_implementation = "eager"
    torch_dtype = torch.float16


##################################

import sys
import os

cwd = os.getcwd()
# sys.path.append(cwd + '/my_directory')
sys.path.append(cwd)


def setting_directory(depth):
    current_dir = os.path.abspath(os.getcwd())
    root_dir = current_dir
    for i in range(depth):
        root_dir = os.path.abspath(os.path.join(root_dir, os.pardir))
        sys.path.append(os.path.dirname(root_dir))
    return root_dir

# I load the model from local directory!
model_path = "/data/bio-eng-llm/llm_repo/mlabonne/OrpoLlama-3-8B"
```

In the following, we will load the OrpoLlama-3-8B in 4-bit precision thanks to bitsandbytes. We then set the LoRA configuration using PEFT for QLoRA. I'm also using the convenient setup_chat_format() function to modify the model and tokenizer for ChatML support. It automatically applies this chat template, adds special tokens, and resizes the model's embedding layer to match the new vocabulary size.



```python

# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype= torch_dtype,
    bnb_4bit_use_double_quant=True,
)

# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation= attn_implementation
)


model, tokenizer = setup_chat_format(model, tokenizer)
model = prepare_model_for_kbit_training(model)
```

Now that the model is ready for training, we can take care of the dataset. We load mlabonne/orpo-dpo-mix-40k and use the apply_chat_template() function to convert the "chosen" and "rejected" columns into the ChatML format. Note that I'm only using 1,000 samples and not the entire dataset, as it would take too long to run.

First, we need to set a few hyperparameters:

    learning_rate: ORPO uses very low learning rates compared to traditional SFT or even DPO. This value of 8e-6 comes from the original paper, and roughly corresponds to an SFT learning rate of 1e-5 and a DPO learning rate of 5e-6. I would recommend increasing it around 1e-6 for a real fine-tune.
    beta: It is the $\lambda$ parameter in the paper, with a default value of 0.1. An appendix from the original paper shows how it's been selected with an ablation study.
    Other parameters, like max_length and batch size are set to use as much VRAM as available (~20 GB in this configuration). Ideally, we would train the model for 3-5 epochs, but we'll stick to 1 here.

Finally, we can train the model using the ORPOTrainer, which acts as a wrapper.


```python
# I saved the dataset in my local directory! but you may not
dataset_name = "/data/bio-eng-llm/llm_repo/mlabonne/OrpoLlama-3-8B"

dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=42).select(range(1000))


def format_chat_template(row):
    row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc= os.cpu_count(),
)
dataset = dataset.train_test_split(test_size=0.01)

epochs=20

orpo_args = ORPOConfig(
    learning_rate=8e-6,
    beta=0.1,
    lr_scheduler_type="linear",
    max_length=1024,
    max_prompt_length=512,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    num_train_epochs=epochs,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    report_to="wandb",
    output_dir="./results/",
)

trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    tokenizer=tokenizer,
)
trainer.train()

import os

# Define the directory where you want to save the model
# 

root_dir = setting_directory(0)

save_dir = root_dir + f"models/fine_tuned_models/OrpoLlama-3-8B_{epochs}e_qa_qa"
#trainer.save_model(save_dir)


# Create the directory if it doesn't exist
os.makedirs(save_dir, exist_ok=True)

# Combine the directory path with the model name
#new_model_path = os.path.join(save_dir, "OrpoLlama-3-8B")

# Save the model to the specified directory
trainer.save_model(save_dir)


#new_model = "OrpoLlama-3-8B"
#trainer.save_model(new_model)
```

Training the model on these 1,000 samples and 20 epochs took about 22 hours on an Nvidia-A100 80GB GPU, but based on the Wnadb graphs only 34GB has been used. Let's check the W&B plots:





## Test the model


### Required packages
```bash
pip install -U transformers datasets accelerate peft trl bitsandbytes wandb
pip install -qqq flash-attn
pip install -qU transformers accelerate
```

_hint_
- wandb account
- visit: https://wandb.ai/your_account
- wnadb token : take your wandb token and save it somewhere



```python
import gc
import os

import torch
import wandb
from datasets import load_dataset



# Directly insert your Weights & Biases API key here
wb_token = 'your_wb_token'
wandb.login(key=wb_token)


from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,)

from trl import ORPOConfig, ORPOTrainer, setup_chat_format



if torch.cuda.get_device_capability()[0] >= 128:
    
    attn_implementation = "flash_attention_2"
    torch_dtype = torch.bfloat16
else:
    attn_implementation = "eager"
    torch_dtype = torch.float16


##################################

import sys
import os

cwd = os.getcwd()
# sys.path.append(cwd + '/my_directory')
sys.path.append(cwd)


def setting_directory(depth):
    current_dir = os.path.abspath(os.getcwd())
    root_dir = current_dir
    for i in range(depth):
        root_dir = os.path.abspath(os.path.join(root_dir, os.pardir))
        sys.path.append(os.path.dirname(root_dir))
    return root_dir

# I loaded the base model form local directory but you may load it directy from huggingface
model_path = "/data/bio-eng-llm/llm_repo/mlabonne/OrpoLlama-3-8B"


###################################
###################################

"""
# Model
base_model = "meta-llama/Meta-Llama-3-8B"
new_model = "OrpoLlama-3-8B"
"""


# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype= torch_dtype,
    bnb_4bit_use_double_quant=True,
)

# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)


# Reload tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model, tokenizer = setup_chat_format(model, tokenizer)



root_dir = setting_directory(0)
epochs = 20

# I loaded the fine tuned model from my local directory but you may have it somewhere elese
new_model_path = root_dir + f"models/fine_tuned_models/OrpoLlama-3-8B_{epochs}e_qa_qa"


### Merge adapter with base model
model = PeftModel.from_pretrained(model, new_model_path)
model = model.merge_and_unload()

print("#############################")
print("#############################")
print(model)




# Pushing the model into the Huggingface hub

from huggingface_hub import HfApi, login

#########################################################
#########################################################
#########################################################
######## Repo token
# Login to Hugging Face
login(token="your_huggingface_token")

# Define your Hugging Face repository name
repo_name = "your_name/OrpoLlama-3-8B_fine_tune_trl"



# Push the model and tokenizer 2
model.push_to_hub(repo_name, use_auth_token=True)
tokenizer.push_to_hub(repo_name, use_auth_token=True)
```



- https://huggingface.co/blog/mlabonne/orpo-llama-3
- mlabonne/orpo-dpo-mix-40k
- https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k/tree/main



[More Information Needed]