legacy-datasets/wikipedia
Updated
•
28.3k
•
557
Wikimedia collections, i.e. Wikipedia, are heavily used in ML research. This collection highlights some prominent examples of these datasets.
Note Wikipedia dataset containing cleaned articles of all languages.
Note The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
Note This is a modified version of https://huggingface.co/datasets/wikitext that returns Wiki pages instead of Wiki text line-by-line.