Data types
Datasets supported by the dataset viewer have a tabular format, meaning a data point is represented in a row and its features are contained in columns. Using the /first-rows
endpoint allows you to preview the first 100 rows of a dataset and information about each feature. Within the features
key, you’ll notice it returns a _type
field. This value describes the data type of the column, and it is also known as a dataset’s Features
.
There are several different data Features
for representing different data formats such as Audio
and Image
for speech and image data respectively. Knowing a dataset feature gives you a better understanding of the data type you’re working with, and how you can preprocess it.
For example, the /first-rows
endpoint for the Rotten Tomatoes dataset returns the following:
{"dataset": "cornell-movie-review-data/rotten_tomatoes",
"config": "default",
"split": "train",
"features": [{"feature_idx": 0,
"name": "text",
"type": {"dtype": "string",
"id": null,
"_type": "Value"}},
{"feature_idx": 1,
"name": "label",
"type": {"num_classes": 2,
"names": ["neg", "pos"],
"id": null,
"_type": "ClassLabel"}}],
...
}
This dataset has two columns, text
and label
:
The
text
column has a type ofValue
. TheValue
type is extremely versatile and represents scalar values such as strings, integers, dates, and even timestamp values.The
label
column has a type ofClassLabel
. TheClassLabel
type represents the number of classes in a dataset and their label names. Naturally, this means you’ll frequently seeClassLabel
used in classification datasets.
For a complete list of available data types, take a look at the Features
documentation.