Dataset viewer documentation

Data types

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Data types

Datasets supported by the dataset viewer have a tabular format, meaning a data point is represented in a row and its features are contained in columns. Using the /first-rows endpoint allows you to preview the first 100 rows of a dataset and information about each feature. Within the features key, you’ll notice it returns a _type field. This value describes the data type of the column, and it is also known as a dataset’s Features.

There are several different data Features for representing different data formats such as Audio and Image for speech and image data respectively. Knowing a dataset feature gives you a better understanding of the data type you’re working with, and how you can preprocess it.

For example, the /first-rows endpoint for the Rotten Tomatoes dataset returns the following:

{"dataset": "cornell-movie-review-data/rotten_tomatoes",
 "config": "default",
 "split": "train",
 "features": [{"feature_idx": 0,
   "name": "text",
   "type": {"dtype": "string", 
   "id": null,
   "_type": "Value"}},
  {"feature_idx": 1,
   "name": "label",
   "type": {"num_classes": 2,
    "names": ["neg", "pos"],
    "id": null,
    "_type": "ClassLabel"}}],
  ...
 }

This dataset has two columns, text and label:

  • The text column has a type of Value. The Value type is extremely versatile and represents scalar values such as strings, integers, dates, and even timestamp values.

  • The label column has a type of ClassLabel. The ClassLabel type represents the number of classes in a dataset and their label names. Naturally, this means you’ll frequently see ClassLabel used in classification datasets.

For a complete list of available data types, take a look at the Features documentation.

< > Update on GitHub