List splits and subsets

Datasets typically have splits and may also have subsets. A split is a subset of the dataset, like train and test, that are used during different stages of training and evaluating a model. A subset (also called configuration) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you’re interested in learning more about splits and subsets, check out the conceptual guide on “Splits and subsets”!

split-configs-server

This guide shows you how to use the dataset viewer’s /splits endpoint to retrieve a dataset’s splits and subsets programmatically. Feel free to also try it out with Postman, RapidAPI, or ReDoc

The /splits endpoint accepts the dataset name as its query parameter:

Python

JavaScript

cURL

The endpoint response is a JSON containing a list of the dataset’s splits and subsets. For example, the ibm/duorc dataset has six splits and two subsets:

{
  "splits": [
    { "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "train" },
    { "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "validation" },
    { "dataset": "ibm/duorc", "config": "ParaphraseRC", "split": "test" },
    { "dataset": "ibm/duorc", "config": "SelfRC", "split": "train" },
    { "dataset": "ibm/duorc", "config": "SelfRC", "split": "validation" },
    { "dataset": "ibm/duorc", "config": "SelfRC", "split": "test" }
  ],
  "pending": [],
  "failed": []
}

< > Update on GitHub

Dataset viewer

List splits and subsets