dvilasuero's picture
dvilasuero HF staff
Update README.md
adb769e verified
metadata
license: apache-2.0
datasets:
  - argilla/distilabel-intel-orca-dpo-pairs
language:
  - en
tags:
  - distilabel
  - dpo
  - rlaif
  - rlhf

⚗️ distilabeled OpenHermes 2.5 Mistral 7B

A Neural DPO of OpenHermes 2.5, high quality matters for DPO!

Built with Distilabel

Introduction

This model is the virtual launching partner of our new open dataset argilla/distilabel-intel-orca-dpo-pairs. It's a DPO fine tune of OpenHermes-2.5-Mistral-7B. It outperforms the awesome mlabonne/NeuralHermes-2.5-Mistral-7B with the exact same DPO recipe but using our new orca-pairs dataset.

The dataset is a "distilabeled" version of the widely used dataset: Intel/orca_dpo_pairs. The original dataset has been used by 100s of open source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved.

Continuing with our mission to build the best alignment datasets for open source LLMs and the community, we spent a few hours to improve it with distilabel.

The main intuition was: the original dataset just assumes gpt4/3.5-turbo are always the best response. We know from UltraFeedback that's not always the case. Moreover, DPO fine-tuning benefits from diversity of preference pairs.

This is what it took to build a real preference dataset with distilabel:

from distilabel.llm import OpenAILLM
from distilabel.tasks import JudgeLMTask
from distilabel.pipeline import Pipeline

from datasets import load_dataset

dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

# this shuffles the pairs to mitigate positional bias
dataset = dataset.map(lambda x: shuffle_and_track(x["chosen"], x["rejected"]))

# we use our JudgeLM implementation to rate the original pairs
labeler = OpenAILLM(
    task=JudgeLMTask(),
    model="gpt-4-1106-preview",
    num_threads=16,
    max_new_tokens=512,
)

dataset = dataset.rename_columns({"question": "input"})

distipipe = Pipeline(
    labeller=labeler
)

# this computes ratings and natural language critiques for each pair
ds = distipipe.generate(dataset=dataset, num_generations=2)

The resulting dataset is now much more useful: we know which response is preferred (by gpt-4-turbo), which ones have low scores, and we even have natural language explanations. But what did we find? Was our intuition confirmed?

image/png

The above chart shows the following:

  • ~4,000 pairs were given the same rating (a tie).
  • ~7,000 pairs were correct according to our AI judge (unchanged).
  • and ~2,000 times the rejected response was preferred (swapped).

Now the next question is: can we build better models with this new knowledge? The answer is "distilabeled Hermes" so let's get back to the model!

If you love datasets as much as we do, check the dataset and share it with your friends and colleagues.

Training details

As we did with Notus, we wanted a reproducible recipe to test the impact of data quality.

And we're lucky to have so many amazing folks in the open community contributing reproducible, easy-to-use training scripts and recipes. This time, Maxime Labonne had shared a Colab to fine-tune OpenHermes with DPO and the original Intel's dataset, perfect! (funnily enough this exact recipe has been used recently to fine-tune the top ranked 7B model).

And that's all for the model part: we reused a good, reproducible recipe.

Once we had created the dataset, the training data part is also kind of boring: we just filtered the samples based on our intuition and with the goal of reducing the dataset size:

  • Ties probably won't help the DPO tuning to learn something meaningful: both responses are similarly good or bad (filter out ties)
  • Very good chosen responses will steer the model to generate good responses (score of chosen response >=8)

Additionally, we did some "decontamination" of gsm8k prompts (very few that were present in the train split of gsm8k).

In code, using our new dataset this translates into:

from datasets import load_dataset

# Instead of this:
# dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

# we did this
dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

dataset = dataset.filter(
    lambda r: 
        r["status"] != "tie" and 
        r["chosen_score"] >= 8 and 
        not r["in_gsm8k_train"]
)

This resulted in 5,922 instead of 12,859 samples (54% reduction) and we run it for 200 steps (using around ~3.2K samples).

Benchmark results

For benchmarking we used the famous "Nous" or "Teknium" benchmark. You can find below an overview, including our first experiment with a less ambitious dataset filtering (removing ties and score>5).

For running the benchmark we used another awesome contribution from Maxime: LLM AutoEval, check it out!

Model AGIEval GPT4All TruthfulQA Bigbench Average
argilla/distilabeled-Hermes-2.5-Mistral-7B 44.64 73.35 55.96 42.21 54.04
dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel (first experiment) 44.27 73.3 56.26 42.25 54.02
mlabonne/NeuralHermes-2.5-Mistral-7B (original recipe) 43.67 73.24 55.37 41.76 53.51
teknium/OpenHermes-2.5-Mistral-7B 42.75 72.99 52.99 40.94 52.42

Update: we now include llm-harness results too!

Model ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
argilla/distilabeled-Hermes-2.5-Mistral-7B 66.04 85.07 Pending 55.96 79.56 66.34
dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel 65.36 84.74 Pending 56.26 79.24 65.13
mlabonne/NeuralHermes-2.5-Mistral-7B 66.55 84.90 63.32 54.93 78.30 61.30
teknium/OpenHermes-2.5-Mistral-7B 64.93 84.18 63.64 52.24 78.06 26.08

Training Hardware

We used 1 x A100 40GB in runpod for less than 1 hour.

Acknowledgements

We'd like to thank the amazing open community and in particular:

  • The Intel team for publishing a great open dataset and show how well it worked in the first place
  • Teknium and NousResearch for their awesome work and models.
  • Maxime for sharing such great resources.