Hugging Face Hub
Now that you are all setup, the first step is to load a dataset. The easiest way to load a dataset is from the Hugging Face Hub. There are already over 900 datasets in over 100 languages on the Hub. Choose from a wide category of datasets to use for NLP tasks like question answering, summarization, machine translation, and language modeling. For a more in-depth look inside a dataset, use the live Datasets Viewer.
Load a dataset
Before you take the time to download a dataset, it is often helpful to quickly get all the relevant information about a dataset. The load_dataset_builder() method allows you to inspect the attributes of a dataset without downloading it.
>>> from datasets import load_dataset_builder
>>> dataset_builder = load_dataset_builder('imdb')
>>> print(dataset_builder.cache_dir)
/Users/thomwolf/.cache/huggingface/datasets/imdb/plain_text/1.0.0/fdc76b18d5506f14b0646729b8d371880ef1bc48a26d00835a7f3da44004b676
>>> print(dataset_builder.info.features)
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}
>>> print(dataset_builder.info.splits)
{'train': SplitInfo(name='train', num_bytes=33432835, num_examples=25000, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32650697, num_examples=25000, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67106814, num_examples=50000, dataset_name='imdb')}
Take a look at DatasetInfo for a full list of attributes you can use with dataset_builder
.
Once you are happy with the dataset you want, load it in a single line with load_dataset():
>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')
Select a configuration
Some datasets, like the General Language Understanding Evaluation (GLUE) benchmark, are actually made up of several datasets. These sub-datasets are called configurations, and you must explicitly select one when you load the dataset. If you donβt provide a configuration name, π€ Datasets will raise a ValueError
and remind you to select a configuration.
Use the get_dataset_config_names() function to retrieve a list of all the possible configurations available to your dataset:
from datasets import get_dataset_config_names
configs = get_dataset_config_names("glue")
print(configs)
# ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
β Incorrect way to load a configuration:
>>> from datasets import load_dataset
>>> dataset = load_dataset('glue')
ValueError: Config name is missing.
Please pick one among the available configs: ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
Example of usage:
*load_dataset('glue', 'cola')*
β Correct way to load a configuration:
>>> dataset = load_dataset('glue', 'sst2')
Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, total: 11.90 MiB) to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0...
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7.44M/7.44M [00:01<00:00, 7.03MB/s]
Dataset glue downloaded and prepared to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will reuse this data.
>>> print(dataset)
{'train': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 67349),
'validation': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 872),
'test': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 1821)
}
Select a split
A split is a specific subset of the dataset like train
and test
. Make sure you select a split when you load a dataset. If you donβt supply a split
argument, π€ Datasets will only return a dictionary containing the subsets of the dataset.
>>> from datasets import load_dataset
>>> datasets = load_dataset('glue', 'mrpc')
>>> print(datasets)
{train: Dataset({
features: ['idx', 'label', 'sentence1', 'sentence2'],
num_rows: 3668
})
validation: Dataset({
features: ['idx', 'label', 'sentence1', 'sentence2'],
num_rows: 408
})
test: Dataset({
features: ['idx', 'label', 'sentence1', 'sentence2'],
num_rows: 1725
})
}
You can list the split names for a dataset, or a specific configuration, with the get_dataset_split_names() method:
>>> from datasets import get_dataset_split_names
>>> get_dataset_split_names('sent_comp')
['validation', 'train']
>>> get_dataset_split_names('glue', 'cola')
['test', 'train', 'validation']