⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2
In this post, I'll show you how to build high-quality datasets for fine-tuning large language models (LLMs) using distilabel and Prometheus 2. Prometheus 2 is an open-source model designed for evaluating LLM generations, providing a cost-effective alternative to GPT-4. This powerful combination allows us to distil both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) datasets efficiently and transparently.
Previously, closed models like GPT-4 were necessary for reliable AI Feedback (AIF) to judge the quality of responses for preference tuning. With Prometheus 2, an open-source model, we can now perform this task more cost-effectively and transparently, setting us up for fully open data generation pipelines.
This post will go through two common synthetic dataset projects using Prometheus 2. First, distilling down an SFT dataset by removing low quality samples based on Prometheus 2 evaluations. Second, expanding an SFT dataset into a DPO one by generating and evaluating responses. You could use these pipelines in sequence or separately and combine them with other datasets.
🍟 Daniel Vila shared this article last month using distilabel, LLama3, and UltraFeedback. Here, I will expand on that pipeline using Prometheus 2 instead of LLama3 and UltraFeedback to judge.
UltraFeedback vs Prometheus 2
UltraFeedback and Prometheus 2 are both methods to evaluate language model outputs, but they differ significantly in their approach and implementation. UltraFeedback, developed by OpenBMB, uses a generic high quality teacher model, typically GPT-4. It focuses on multiple aspects such as instruction-following, truthfulness, honesty, and helpfulness, and generates individual ratings. On the other hand, Prometheus 2 is an open-source model that has been finetuned on evaluation data — from GPT-4. Prometheus 2 serves as an alternative to GPT-4 for fine-grained evaluations. It employs weight merging techniques to support both absolute grading (direct assessment of a single) and relative grading (pairwise ranking), making it versatile for different evaluation needs.
1. Distilling an SFT Dataset
First we’ll clean up the SFT dataset by removing low quality samples.
A Supervised Fine-Tuning (SFT) dataset is a collection of paired instructions and responses used to refine a pretrained language model's performance on specific tasks. This process involves taking a general-purpose model and training it on a targeted dataset where each instruction (or prompt) has a corresponding ideal response. The goal of SFT is to ensure that the model can generate accurate, relevant, and high-quality outputs when given similar prompts in the future. By fine-tuning the model with this curated data, we can significantly enhance its ability to perform well on desired applications, making the output more aligned with specific requirements and use cases.
Ingredients
- Dataset with prompts and responses: Use
openbmb/UltraInteract_sft
, a dataset containing high-quality prompts curated by the community. - Model to judge response quality: Prometheus 2 will evaluate the responses, providing a reliable, open-source alternative to closed models like GPT-4.
Steps
Let’s walk through each step of the pipeline to understand what it’s doing. Below I’ll share an end-to-end example that you can copy-past into your own projects.
Step 1: Load Dataset
Start by loading the source data using distilabel. Begin with a small sample to ensure everything works before scaling up. Here, I use openbmb/UltraInteract_sft
from Open BMB, but you could start from any dataset with instruction response pairs. You can use the output_mapping
parameter if the column names are different.
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="openbmb/UltraInteract_sft",
split="train",
batch_size=5,
num_examples=100
)
Step 2: Judge Responses with Prometheus 2
Rate the quality of responses using Prometheus 2. We will load the Prometheus evaluation task and prompts through distilabel
's intgegration, PrometheusEval``, and the model weights in
vLLM. We are only evaluating one sample so we will use the
abosulutemode, as apposed to
relative. and we will focus on the
factual-validity` rubric. We could also supply multiple rubrics here. Check out the Prometheus 2 repo for details on modes and rubrics.
prometheus = PrometheusEval(
name="prometheus",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\\n{{ messages[1]['content'] }}[/INST]",
),
mode="absolute",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)
2. Building a DPO Dataset from an SFT Dataset
The next pipeline focuses on creating a Direct Preference Optimization (DPO) dataset by generating and evaluating additional responses.
A Direct Preference Optimization (DPO) dataset is designed to train language models by providing them with explicit preferences between different responses to the same instruction. It consists of an instruction followed by two responses: one 'chosen' or ideal, and the other 'rejected'. This setup allows the model to learn directly from human preferences, optimizing its output to align better with desired responses. Unlike traditional reinforcement learning methods that require complex reward models, DPO simplifies the process by treating preference learning as a straightforward classification problem, thus making it more stable and efficient.
Ingredients
- Initial SFT dataset: This serves as the base, containing instruction-response pairs.
- Model to generate additional responses: Llama3 models (8B and 70B instruct versions) are used to generate the extra responses needed for DPO datasets.
- Model to judge response quality: Prometheus 2 will evaluate the responses, determining which is the 'chosen' and which is the 'rejected' response.
Steps
Step 1: Load Dataset
Like with SFT distilation, we will load a dataset using distilabel's integration. In fact, you could start from any dataset with a prompt column because we are going to generate multiple responses.
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="openbmb/UltraInteract_sft",
split="train",
batch_size=3,
num_examples=3,
)
Step 2: Generate Responses
This step takes the prompts from our load_dataset
step and generates responses using Llama3 models. It is common to generate responses with multiple models of differing expected quality to populate both chosen and rejected responses. Prometheus 2 will define the quality of responses. Note, we could also just use the responses in the SFT dataset itself, but I’ll include this example so it can be applied to more datasets.
Inference is performed using Hugging face inference endpoints. In order to use extensively the serverless Inference Endpoints deployed in the Hugging Face Hub, subscribing to Pro is recommended (see [pricing](https://huggingface.co/pricing)), since [Inference for PROs](https://huggingface.co/blog/inference-pro) will be enabled and you will have improved rate limits for the usage of the free Inference API.
generate_with_llama3_70B = TextGeneration(
name="generate_with_llama3_70B",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
)
generate_with_llama3_8B = TextGeneration(
name="generate_with_llama3_8B",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
),
)
Step 3: Combine Columns
Prepare input for the Prometheus evaluation by combining generations from multiple models into a single column.
combine_columns = CombineColumns(
name="combine_columns",
columns=["generation", "model_name"],
output_columns=["generations", "generation_models"]
)
Step 4: Prometheus Evaluation
Rate the quality of responses using Prometheus 2. Once again, we will load the Prometheus evaluation task and prompts through distilabel
's intgegration, PrometheusEval
. For DPO, we will use the relative
mode because we are comparing responses. Check out the Prometheus 2 repo for details on modes and rubrics.
prometheus = PrometheusEval(
name="prometheus",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\\n{{ messages[1]['content'] }}[/INST]",
),
mode="relative",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)
Step 5: Keep Columns
Retain the necessary columns for the final dataset.
keep_columns = KeepColumns(
name="keep_columns",
columns=["instruction", "generations", "feedback", "result", "model_name"]
)
Next Steps
Having walked through the process of distilling an SFT dataset and expanding it into a DPO dataset using Prometheus 2 and distilabel, there are several exciting directions to explore next. Firstly, you can experiment with different rubrics and evaluation modes within Prometheus 2 to see how they affect the quality and performance of your datasets. Additionally, consider scaling up your datasets and running more extensive evaluations to further validate the improvements. Another valuable step would be to integrate human feedback, leveraging platforms like Argilla to improve and or the model outputs.
Complete Pipeline Examples
Here are the end-to-end examples for running these pipelines, refer to distilabel documentation for further guidance.
SFT Pipeline
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadHubDataset, CombineColumns
from distilabel.steps.tasks import PrometheusEval, TextGeneration
from distilabel.llms import vLLM
with Pipeline(name="prometheus-SFT") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="openbmb/UltraInteract_sft",
split="train",
batch_size=5,
num_examples=3,
)
prometheus = PrometheusEval(
name="prometheus",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\\n{{ messages[1]['content'] }}[/INST]",
),
mode="absolute",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)
keep_columns = KeepColumns(
name="keep_columns",
columns=["instruction", "generation", "result", "model_name", "feedback"],
)
load_dataset.connect(prometheus)
prometheus.connect(keep_columns)
DPO Pipeline
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadHubDataset, CombineColumns
from distilabel.steps.tasks import PrometheusEval, TextGeneration
from distilabel.llms import vLLM
with Pipeline(name="prometheus-DPO") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="openbmb/UltraInteract_sft",
split="train",
batch_size=3,
num_examples=3,
)
generate_with_llama3_70B = TextGeneration(
name="generate_with_llama3_70B",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
)
generate_with_llama3_8B = TextGeneration(
name="generate_with_llama3_8B",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
),
)
combine_columns = CombineColumns(
name="combine_columns",
columns=["generation", "model_name"],
output_columns=["generations", "generation_models"],
)
prometheus = PrometheusEval(
name="prometheus",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\\n{{ messages[1]['content'] }}[/INST]",
),
mode="relative",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)
keep_columns = KeepColumns(
name="keep_columns",
columns=["instruction", "generations", "feedback", "result", "model_name"],
)
push_to_argilla = DPOToArgilla(
name="push_to_argilla"
)
load_dataset.connect(combine_columns)
load_dataset.connect(generate_with_llama3_70B)
load_dataset.connect(generate_with_llama3_8B)
generate_with_llama3_70B.connect(combine_columns)
generate_with_llama3_8B.connect(combine_columns)
combine_columns.connect(prometheus)
prometheus.connect(keep_columns)