The Beginners Guide to Cleaning a Dataset
Sometimes datasets are raw, unfiltered, and not 100% ready for training and other tasks. Dataset cleaning is a very common task for getting a dataset ready for training.
Microsoft Orca Agent Instruct Dataset
One example is Maxime Labonne's cleaned version of Microsoft Orca Agent Instruct Dataset which can be found at mlabonne/orca-agentinstruct-1M-v1-cleaned.
The Microsoft Orca Agent Instruct Dataset is a dataset of ~1M instruction examples across a wide range of tasks including text editing, creative writing, coding, reading comprehension and brain teasers.
Cleaning Steps
In the dataset card, Maxime outlined the cleaning process:
Here's what I changed:
- Splits are unified into one, with a new "split" column
- Strings were converted into lists of dicts to ensure compatibility with most frameworks
- Empty system prompts were removed so you don't get weird errors
In this post, we will walk through how to do this really easily without leaving the webpage of the dataset. We will break down the steps above and iteratively write a bit of SQL to accomplish this via the SQL Console.
Open the SQL Console
To start we will click the SQL Console
button on the dataset page which will open the SQL Console and automatically load a subset and split.
There's a couple things that you will notice. First, the 1/15 splits loaded, which indicates that we have loaded only the creative_content
split and their are 14 other unloaded views. We can simply click on the other splits to autmoatically load them into DuckDB so we can query them.
1. Converting string into STRUCTs
Converting strings into list of dictionaries is fairly simple. We know that the messages
field is an list of conversation messages with keys: role
and content
. We can represent that in DuckDB as a STRUCT and use CAST to convert the VARCHAR type to the Struct.
CAST(messages AS STRUCT(role VARCHAR, content VARCHAR)[]) as messages
2. Removing empty system prompts
Next, let's filter out messages that are empty and just start the conversation with the user message.
Since messages is a LIST type, we can use the handy DuckDB list function: list_filter to filter out the empty messages.
We can start by expanding on our code above and using list_filter
:
-- filter out if content is null or empty string
list_filter(messages, x -> x.content != '' AND x.content is not null) as messages
This will filter out all empty messages. If we wanted to be more specific and filter out only empty system prompts, we can update our lambda to do so.
3. Joining splits together
Now that we are converting messages from string to list of structs and removing empty messages, let's join two splits and add a column indicating which split it is.
Let's start by:
- Loading all the splits as views that you would want to join
- Use the SQL UNION operator to combine the results of all the splits
We will just join two splits since the dataset will be too big to do all in the browser. But for smaller datasets, it is very capable of joining entire datasets!
WITH joined_data as (
SELECT *, 'creative_content' as split FROM creative_content
UNION ALL
SELECT *, 'text_modification' as split FROM text_modification
)
select * from joined_data
we can use the UNION ALL to join all the rows together
Finally, we can add the other pieces to get a final cleaned joined result:
-- join the splits together
WITH joined_data as (
SELECT *, 'creative_content' as split FROM creative_content
UNION ALL
SELECT *, 'text_modification' as split FROM text_modification
)
select
list_filter(
CAST(messages AS STRUCT(role VARCHAR, content VARCHAR)[]), -- cast string to list of structs
x -> x.content != '' AND x.content is not null -- filter out empty messages
) AS messages,
split
from joined_data
Now, we have a dataset that we can then download and use!