OrpoLlama-3-8B_fine_tune_trl / README.md

Update README.md

062d453 verified 3 months ago

16.8 kB

	---
	library_name: transformers
	tags: []
	---

	# Fine-tune Llama 3 with ORPO

	ORPO is a new exciting fine-tuning technique that combines the traditional supervised fine-tuning and preference alignment stages into a single process. This reduces the computational resources and time required for training. Moreover, empirical results demonstrate that ORPO outperforms other alignment methods on various model sizes and benchmarks.

	In this article, we will fine-tune the new Llama 3 8B model using ORPO with the TRL library.

	<!-- Provide a quick summary of what the model is/does. -->

	## ORPO
	Instruction tuning and preference alignment are essential techniques for adapting Large Language Models (LLMs) to specific tasks. Traditionally, this involves a multi-stage process: 1/ Supervised Fine-Tuning (SFT) on instructions to adapt the model to the target domain, followed by 2/ preference alignment methods like Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) to increase the likelihood of generating preferred responses over rejected ones.

	However, researchers have identified a limitation in this approach. While SFT effectively adapts the model to the desired domain, it inadvertently increases the probability of generating undesirable answers alongside preferred ones. This is why the preference alignment stage is necessary to widen the gap between the likelihoods of preferred and rejected outputs.

	see more on ORPO [link](https://arxiv.org/abs/2403.07691)

	## Fine-tuning Llama 3 with ORPO

	[Llama 3](https://github.com/meta-llama/llama3/tree/main) is the latest family of LLMs developed by Meta. The models were trained on an extensive dataset of 15 trillion tokens (compared to 2T tokens for Llama 2). Two model sizes have been released: a 70 billion parameter model and a smaller 8 billion parameter model. The 70B model has already demonstrated impressive performance, scoring 82 on the MMLU benchmark and 81.7 on the HumanEval benchmark.

	Llama 3 models also increased the context length up to 8,192 tokens (4,096 tokens for Llama 2), and potentially scale up to 32k with RoPE. Additionally, the models use a new tokenizer with a 128K-token vocabulary, reducing the number of tokens required to encode text by 15%. This vocabulary also explains the bump from 7B to 8B parameters.

	# Required packages
	```bash
	pip install -U transformers datasets accelerate peft trl bitsandbytes wandb
	pip install -qqq flash-attn
	pip install -qU transformers accelerate
	```

	Once it's installed, we can import the necessary libraries and log in to W&B (optional):

	```python

	"""
	wandb
	https://wandb.ai/wandb_account
	you need wb_token as well
	"""

	import gc
	import os

	import torch
	import wandb
	from datasets import load_dataset



	# Directly insert your Weights & Biases API key here
	wb_token = 'your_wb_token'
	wandb.login(key=wb_token)


	from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training

	from transformers import (
	AutoModelForCausalLM,
	AutoTokenizer,
	BitsAndBytesConfig,
	TrainingArguments,
	pipeline,)

	from trl import ORPOConfig, ORPOTrainer, setup_chat_format
	```

	If you have a recent GPU, you should also be able to use the Flash Attention library to replace the default eager attention implementation with a more efficient one.


	```python

	if torch.cuda.get_device_capability()[0] >= 128:

	attn_implementation = "flash_attention_2"
	torch_dtype = torch.bfloat16
	else:
	attn_implementation = "eager"
	torch_dtype = torch.float16


	##################################

	import sys
	import os

	cwd = os.getcwd()
	# sys.path.append(cwd + '/my_directory')
	sys.path.append(cwd)


	def setting_directory(depth):
	current_dir = os.path.abspath(os.getcwd())
	root_dir = current_dir
	for i in range(depth):
	root_dir = os.path.abspath(os.path.join(root_dir, os.pardir))
	sys.path.append(os.path.dirname(root_dir))
	return root_dir

	# I load the model from local directory!
	model_path = "/data/bio-eng-llm/llm_repo/mlabonne/OrpoLlama-3-8B"
	```

	In the following, we will load the OrpoLlama-3-8B in 4-bit precision thanks to bitsandbytes. We then set the LoRA configuration using PEFT for QLoRA. I'm also using the convenient setup_chat_format() function to modify the model and tokenizer for ChatML support. It automatically applies this chat template, adds special tokens, and resizes the model's embedding layer to match the new vocabulary size.



	```python

	# QLoRA config
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype= torch_dtype,
	bnb_4bit_use_double_quant=True,
	)

	# LoRA config
	peft_config = LoraConfig(
	r=16,
	lora_alpha=32,
	lora_dropout=0.05,
	bias="none",
	task_type="CAUSAL_LM",
	target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
	)

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	# Load model
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	quantization_config=bnb_config,
	device_map="auto",
	attn_implementation= attn_implementation
	)


	model, tokenizer = setup_chat_format(model, tokenizer)
	model = prepare_model_for_kbit_training(model)
	```

	Now that the model is ready for training, we can take care of the dataset. We load mlabonne/orpo-dpo-mix-40k and use the apply_chat_template() function to convert the "chosen" and "rejected" columns into the ChatML format. Note that I'm only using 1,000 samples and not the entire dataset, as it would take too long to run.

	First, we need to set a few hyperparameters:

	learning_rate: ORPO uses very low learning rates compared to traditional SFT or even DPO. This value of 8e-6 comes from the original paper, and roughly corresponds to an SFT learning rate of 1e-5 and a DPO learning rate of 5e-6. I would recommend increasing it around 1e-6 for a real fine-tune.
	beta: It is the $\lambda$ parameter in the paper, with a default value of 0.1. An appendix from the original paper shows how it's been selected with an ablation study.
	Other parameters, like max_length and batch size are set to use as much VRAM as available (~20 GB in this configuration). Ideally, we would train the model for 3-5 epochs, but we'll stick to 1 here.

	Finally, we can train the model using the ORPOTrainer, which acts as a wrapper.


	```python

	dataset_name = "/data/bio-eng-llm/llm_repo/mlabonne/OrpoLlama-3-8B"

	dataset = load_dataset(dataset_name, split="all")
	dataset = dataset.shuffle(seed=42).select(range(1000))


	def format_chat_template(row):
	row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
	row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
	return row

	dataset = dataset.map(
	format_chat_template,
	num_proc= os.cpu_count(),
	)
	dataset = dataset.train_test_split(test_size=0.01)

	epochs=20

	orpo_args = ORPOConfig(
	learning_rate=8e-6,
	beta=0.1,
	lr_scheduler_type="linear",
	max_length=1024,
	max_prompt_length=512,
	per_device_train_batch_size=2,
	per_device_eval_batch_size=2,
	gradient_accumulation_steps=4,
	optim="paged_adamw_8bit",
	num_train_epochs=epochs,
	evaluation_strategy="steps",
	eval_steps=0.2,
	logging_steps=1,
	warmup_steps=10,
	report_to="wandb",
	output_dir="./results/",
	)

	trainer = ORPOTrainer(
	model=model,
	args=orpo_args,
	train_dataset=dataset["train"],
	eval_dataset=dataset["test"],
	peft_config=peft_config,
	tokenizer=tokenizer,
	)
	trainer.train()

	import os

	# Define the directory where you want to save the model
	#

	root_dir = setting_directory(0)

	save_dir = root_dir + f"models/fine_tuned_models/OrpoLlama-3-8B_{epochs}e_qa_qa"
	#trainer.save_model(save_dir)


	# Create the directory if it doesn't exist
	os.makedirs(save_dir, exist_ok=True)

	# Combine the directory path with the model name
	#new_model_path = os.path.join(save_dir, "OrpoLlama-3-8B")

	# Save the model to the specified directory
	trainer.save_model(save_dir)


	#new_model = "OrpoLlama-3-8B"
	#trainer.save_model(new_model)
	```

	Training the model on these 1,000 samples and 20 epochs took about 22 hours on an Nvidia-A100 80GB GPU, but based on the Wnadb graphs only 34GB has been used. Let's check the W&B plots:





	## Test the model

	# -- coding: utf-8 --
	"""
	Created on Wed Jul 3 15:57:22 2024

	@author: Ali forootani
	"""

	"""
	!pip install -U transformers datasets accelerate peft trl bitsandbytes wandb
	!pip install -qqq flash-attn
	!pip install -qU transformers accelerate
	"""

	"""
	wandb
	https://wandb.ai/your_account
	dde689e74d3f9146d2d116b098016f5e0d9cc202
	"""


	```python
	import gc
	import os

	import torch
	import wandb
	from datasets import load_dataset



	# Directly insert your Weights & Biases API key here
	wb_token = 'your_wb_token'
	wandb.login(key=wb_token)


	from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training

	from transformers import (
	AutoModelForCausalLM,
	AutoTokenizer,
	BitsAndBytesConfig,
	TrainingArguments,
	pipeline,)

	from trl import ORPOConfig, ORPOTrainer, setup_chat_format




	"""
	https://huggingface.co/blog/mlabonne/orpo-llama-3

	mlabonne/orpo-dpo-mix-40k

	https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k/tree/main
	"""

	if torch.cuda.get_device_capability()[0] >= 128:

	attn_implementation = "flash_attention_2"
	torch_dtype = torch.bfloat16
	else:
	attn_implementation = "eager"
	torch_dtype = torch.float16


	##################################

	import sys
	import os

	cwd = os.getcwd()
	# sys.path.append(cwd + '/my_directory')
	sys.path.append(cwd)


	def setting_directory(depth):
	current_dir = os.path.abspath(os.getcwd())
	root_dir = current_dir
	for i in range(depth):
	root_dir = os.path.abspath(os.path.join(root_dir, os.pardir))
	sys.path.append(os.path.dirname(root_dir))
	return root_dir


	model_path = "/data/bio-eng-llm/llm_repo/mlabonne/OrpoLlama-3-8B"


	###################################
	###################################

	"""
	# Model
	base_model = "meta-llama/Meta-Llama-3-8B"
	new_model = "OrpoLlama-3-8B"
	"""


	# QLoRA config
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype= torch_dtype,
	bnb_4bit_use_double_quant=True,
	)

	# LoRA config
	peft_config = LoraConfig(
	r=16,
	lora_alpha=32,
	lora_dropout=0.05,
	bias="none",
	task_type="CAUSAL_LM",
	target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
	)


	# Reload tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	low_cpu_mem_usage=True,
	return_dict=True,
	torch_dtype=torch.float16,
	device_map="auto",
	)
	model, tokenizer = setup_chat_format(model, tokenizer)



	root_dir = setting_directory(0)
	epochs = 20

	new_model_path = root_dir + f"models/fine_tuned_models/OrpoLlama-3-8B_{epochs}e_qa_qa"


	### Merge adapter with base model
	model = PeftModel.from_pretrained(model, new_model_path)
	model = model.merge_and_unload()

	print("#############################")
	print("#############################")
	print(model)




	# Pushing the model into the Huggingface hub

	from huggingface_hub import HfApi, login

	#########################################################
	#########################################################
	#########################################################
	######## Repo token
	# Login to Hugging Face
	login(token="your_huggingface_token")

	# Define your Hugging Face repository name
	repo_name = "your_name/OrpoLlama-3-8B_fine_tune_trl"



	# Push the model and tokenizer 2
	model.push_to_hub(repo_name, use_auth_token=True)
	tokenizer.push_to_hub(repo_name, use_auth_token=True)
	```




	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: [More Information Needed]
	- Funded by [optional]: [More Information Needed]
	- Shared by [optional]: [More Information Needed]
	- Model type: [More Information Needed]
	- Language(s) (NLP): [More Information Needed]
	- License: [More Information Needed]
	- Finetuned from model [optional]: [More Information Needed]

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: [More Information Needed]
	- Paper [optional]: [More Information Needed]
	- Demo [optional]: [More Information Needed]

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	[More Information Needed]

	### Downstream Use [optional]

	<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

	[More Information Needed]

	### Out-of-Scope Use

	<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

	[More Information Needed]

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	[More Information Needed]

	### Recommendations

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	[More Information Needed]

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	[More Information Needed]

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	#### Preprocessing [optional]

	[More Information Needed]


	#### Training Hyperparameters

	- Training regime: [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	[More Information Needed]

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[More Information Needed]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	[More Information Needed]

	### Results

	[More Information Needed]

	#### Summary



	## Model Examination [optional]

	<!-- Relevant interpretability work for the model goes here -->

	[More Information Needed]

	## Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: [More Information Needed]
	- Hours used: [More Information Needed]
	- Cloud Provider: [More Information Needed]
	- Compute Region: [More Information Needed]
	- Carbon Emitted: [More Information Needed]

	## Technical Specifications [optional]

	### Model Architecture and Objective

	[More Information Needed]

	### Compute Infrastructure

	[More Information Needed]

	#### Hardware

	[More Information Needed]

	#### Software

	[More Information Needed]

	## Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	[More Information Needed]

	APA:

	[More Information Needed]

	## Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	[More Information Needed]

	## More Information [optional]

	[More Information Needed]

	## Model Card Authors [optional]

	[More Information Needed]

	## Model Card Contact

	[More Information Needed]