Explore statistics over split data
The dataset viewer provides a /statistics
endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.
The /statistics
endpoint requires three query parameters:
dataset
: the dataset name, for examplenyu-mll/glue
config
: the subset name, for examplecola
split
: the split name, for exampletrain
Let’s get some stats for nyu-mll/glue
dataset, cola
subset, train
split:
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/statistics?dataset=nyu-mll/glue&config=cola&split=train"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
The response JSON contains three keys:
num_examples
- number of samples in a split or number of samples in the first chunk of data if dataset is larger than 5GB (seepartial
field below).statistics
- list of dictionaries of statistics per each column, each dictionary has three keys:column_name
,column_type
, andcolumn_statistics
. Content ofcolumn_statistics
depends on a column type, see Response structure by data types for more detailspartial
-true
if statistics are computed on the first 5 GB of data, not on the full split,false
otherwise.
{
"num_examples": 8551,
"statistics": [
{
"column_name": "idx",
"column_type": "int",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 0,
"max": 8550,
"mean": 4275,
"median": 4275,
"std": 2468.60541,
"histogram": {
"hist": [
856,
856,
856,
856,
856,
856,
856,
856,
856,
847
],
"bin_edges": [
0,
856,
1712,
2568,
3424,
4280,
5136,
5992,
6848,
7704,
8550
]
}
}
},
{
"column_name": "label",
"column_type": "class_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"no_label_count": 0,
"no_label_proportion": 0,
"n_unique": 2,
"frequencies": {
"unacceptable": 2528,
"acceptable": 6023
}
}
},
{
"column_name": "sentence",
"column_type": "string_text",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 6,
"max": 231,
"mean": 40.70074,
"median": 37,
"std": 19.14431,
"histogram": {
"hist": [
2260,
4512,
1262,
380,
102,
26,
6,
1,
1,
1
],
"bin_edges": [
6,
29,
52,
75,
98,
121,
144,
167,
190,
213,
231
]
}
}
}
],
"partial": false
}
Response structure by data type
Currently, statistics are supported for strings, float and integer numbers, lists, audio and image data and the special datasets.ClassLabel
feature type of the datasets
library.
column_type
in response can be one of the following values:
class_label
- fordatasets.ClassLabel
feature which represents categorical datafloat
- for float data typesint
- for integer data typesbool
- for boolean data typestring_label
- for string data types being treated as categories (see below)string_text
- for string data types if they do not represent categories (see below)list
- for lists of any other data types (including lists)audio
- for audio dataimage
- for image data
class_label
This type represents categorical data encoded as ClassLabel
feature. The following measures are computed:
- number and proportion of
null
values - number and proportion of values with no label
- number of unique values (excluding
null
andno label
) - value counts for each label (excluding
null
andno label
)
Example
{
"column_name": "label",
"column_type": "class_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"no_label_count": 0,
"no_label_proportion": 0,
"n_unique": 2,
"frequencies": {
"unacceptable": 2528,
"acceptable": 6023
}
}
}
float
The following measures are returned for float data types:
- minimum, maximum, mean, and standard deviation values
- number and proportion of
null
andNaN
values (NaN
values are treated asnull
) - histogram with 10 bins
Example
{
"column_name": "clarity",
"column_type": "float",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 0,
"max": 2,
"mean": 1.67206,
"median": 1.8,
"std": 0.38714,
"histogram": {
"hist": [
17,
12,
48,
52,
135,
188,
814,
15,
1628,
2048
],
"bin_edges": [
0,
0.2,
0.4,
0.6,
0.8,
1,
1.2,
1.4,
1.6,
1.8,
2
]
}
}
}
int
The following measures are returned for integer data types:
- minimum, maximum, mean, and standard deviation values
- number and proportion of
null
values - histogram with less than or equal to 10 bins
Example
{
"column_name": "direction",
"column_type": "int",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 0,
"max": 1,
"mean": 0.49925,
"median": 0.0,
"std": 0.5,
"histogram": {
"hist": [
50075,
49925
],
"bin_edges": [
0,
1,
1
]
}
}
}
bool
The following measures are returned for bool data type:
- number and proportion of
null
values - value counts for
'True'
and'False'
values
Example
{
"column_name": "penalty",
"column_type": "bool",
"column_statistics":
{
"nan_count": 3,
"nan_proportion": 0.15,
"frequencies": {
"False": 7,
"True": 10
}
}
}
string_label
If the proportion of unique values in a string column within requested split is lower than or equal to 0.2 and the number of unique values is lower than 1000, or if the number of unique values is lower or equal to 10 (independently of the proportion), it is considered to be a category. The following measures are returned:
- number and proportion of
null
values - number of unique values (excluding
null
) - value counts for each label (excluding
null
)
Example
{
"column_name": "answerKey",
"column_type": "string_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"n_unique": 4,
"frequencies": {
"D": 1221,
"C": 1146,
"A": 1378,
"B": 1212
}
}
}
string_text
If string column does not satisfy the conditions to be treated as a string_label
, it is considered to be a column containing texts and response contains statistics over text lengths which are calculated by character number. The following measures are computed:
- minimum, maximum, mean, and standard deviation of text lengths
- number and proportion of
null
values - histogram of text lengths with 10 bins
Example
{
"column_name": "sentence",
"column_type": "string_text",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 6,
"max": 231,
"mean": 40.70074,
"median": 37,
"std": 19.14431,
"histogram": {
"hist": [
2260,
4512,
1262,
380,
102,
26,
6,
1,
1,
1
],
"bin_edges": [
6,
29,
52,
75,
98,
121,
144,
167,
190,
213,
231
]
}
}
}
list
For lists, the distribution of their lengths is computed. The following measures are returned:
- minimum, maximum, mean, and standard deviation of lists lengths
- number and proportion of
null
values - histogram of lists lengths with up to 10 bins
Example
{
"column_name": "chat_history",
"column_type": "list",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 1,
"max": 3,
"mean": 1.01741,
"median": 1.0,
"std": 0.13146,
"histogram": {
"hist": [
11177,
196,
1
],
"bin_edges": [
1,
2,
3,
3
]
}
}
}
Note that dictionaries of lists are not supported.
audio
For audio data, the distribution of audio files durations is computed. The following measures are returned:
- minimum, maximum, mean, and standard deviation of audio files durations
- number and proportion of
null
values - histogram of audio files durations with 10 bins
Example
{
"column_name": "audio",
"column_type": "audio",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 1.02,
"max": 15,
"mean": 13.93042,
"median": 14.77,
"std": 2.63734,
"histogram": {
"hist": [
32,
25,
18,
24,
22,
17,
18,
19,
55,
1770
],
"bin_edges": [
1.02,
2.418,
3.816,
5.214,
6.612,
8.01,
9.408,
10.806,
12.204,
13.602,
15
]
}
}
}
image
For image data, the distribution of images widths is computed. The following measures are returned:
- minimum, maximum, mean, and standard deviation of widths of image files
- number and proportion of
null
values - histogram of images widths with 10 bins
Example
{
"column_name": "image",
"column_type": "image",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 256,
"max": 873,
"mean": 327.99339,
"median": 341.0,
"std": 60.07286,
"histogram": {
"hist": [
1734,
1637,
1326,
121,
10,
3,
1,
3,
1,
2
],
"bin_edges": [
256,
318,
380,
442,
504,
566,
628,
690,
752,
814,
873
]
}
}
}