alvarobartt's picture
alvarobartt HF staff
Duplicate from alvarobartt/Mistral-7B-v0.1-ORPO
cd15223 verified
metadata
license: apache-2.0
datasets:
  - alvarobartt/dpo-mix-7k-simplified
  - argilla/dpo-mix-7k
base_model: mistralai/Mistral-7B-v0.1
language:
  - en
library_name: peft
pipeline_tag: text-generation
inference: false
tags:
  - orpo
  - qlora
  - trl

ORPO fine-tune of Mistral 7B v0.1 with DPO Mix 7K

image/jpeg

Stable Diffusion XL "A capybara, a killer whale, and a robot named Ultra being friends"

This is an ORPO fine-tune of mistralai/Mistral-7B-v0.1 with alvarobartt/dpo-mix-7k-simplified.

โš ๏ธ Note that the code is still experimental, as the ORPOTrainer PR is still not merged, follow its progress at ๐Ÿค—trl - ORPOTrainer PR.

About the fine-tuning

In order to fine-tune mistralai/Mistral-7B-v0.1 using ORPO, the branch orpo from ๐Ÿค—trl has been used, thanks to the invaluable and quick contribution of @kashif.

ORPO stands for Odds Ratio Preference Optimization, and defines a new paradigm on fine-tuning LLMs, โ€œcombiningโ€ both the SFT and the PPO/DPO stage into a single stage, thanks to the proposed loss function starting off from a preference dataset i.e. chosen-rejected pairs.

Some key features about ORPO:

  • โšก๏ธ Faster to train as itโ€™s now a single stage fine-tuning
  • ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Requires preference data i.e. (prompt, chosen, rejected)-like datasets
  • โฌ‡๏ธ Less memory than PPO/DPO as doesnโ€™t need a reference model
  • ๐Ÿ† SOTA results for Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) when fine-tuned using single-turn UltraFeedback

Some notes on the experiments mentioned in the paper:

  • ๐Ÿ“Œ Up to 7B parameter LLMs were fine-tuned, achieving better performance compared to 7B counterparts and even 13B LLMs
  • ๐Ÿ“Œ Not yet trained with multi-turn datasets as Capybara (may be an interesting experiment to run)
  • ๐Ÿ“Œ For OPT models fine-tuned with HH-RLHF from Anthropic, truncated and padded to 1024 tokens, filtering out filtering the prompts with > 1024 tokens
  • ๐Ÿ“Œ For Phi-2, Mistral (7B) and Llama 2 (7B), or UltraFeedback from OpenBMB (truncated and padded to 2048 tokens), filtering out filtering the prompts with > 1024 tokens
  • ๐Ÿ“Œ Fine-tuned for 10 epochs, and using the evaluation loss as the metric for selecting the best models

For more information about ORPO, I highly recommend reading their paper titled ORPO: Monolithic Preference Optimization without Reference Model, as it contains a lot of information and details not only on the ORPO method, but also on the experiment they ran, the results they got, and much more.

๐Ÿ“… Fine-tuning code will be shared soon, stay tuned!

About the dataset

The dataset used for this fine-tune is alvarobartt/dpo-mix-7k-simplified, which is a simplified version of argilla/dpo-mix-7k.

The simplification comes from the fact that the prompt column is detached from both the chosen and rejected columns so that there's no need for extra pre-processing while applying the chat template to the dataset before the fine-tuning. So on, the dataset remains as is, with an additional column for the prompt.

The dataset is a small cocktail combining Argilla's latest efforts on DPO datasets, mixing the following datasets:

The samples have been randomly selected from the original datasets with a proportion of 0.33 each, as can be seen via the dataset column of the dataset.

For more information about the original dataset check the README.md file of argilla/dpo-mix-7k.