HuggingFaceFW
Enterprise
community
AI & ML interests
None defined yet.
Organization Card
🤗 HuggingFace 🍷 FineWeb datasets
Read our technical report!
This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web (CommonCrawl), released under a permissive license (ODC-By).
The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the 🤗 libraries datatrove
, nanotron
or lighteval
.
Version 1 of the 🍷 FineWeb dataset is available here. Our ablation models can be found here.
Collections
4
models
30
HuggingFaceFW/fineweb-edu-classifier
Text Classification
•
Updated
•
190k
•
132
HuggingFaceFW/Datasets-Metrics-Viewer-Data
Updated
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation
•
Updated
•
627
•
11
HuggingFaceFW/ablation-exp-filter-custom-all_filters-28BT
Text Generation
•
Updated
•
10
•
1
HuggingFaceFW/ablation-exp-filter-custom-line_char_duplicated_0.01-28BT
Text Generation
•
Updated
•
10
•
2
HuggingFaceFW/ablation-exp-filter-custom-line_ratio_0.67-28BT
Text Generation
•
Updated
•
17
HuggingFaceFW/ablation-exp-filter-custom-lines_punct_0.12-28BT
Text Generation
•
Updated
•
12
•
3
HuggingFaceFW/ablation-exp-filter-baseline_c4-28BT
Text Generation
•
Updated
•
16
•
2
HuggingFaceFW/ablation-exp-filter-baseline_cc-28BT
Text Generation
•
Updated
•
11
•
4
HuggingFaceFW/ablation-exp-filter-c4-word_lengths-28BT
Text Generation
•
Updated
•
10
•
2
datasets
5
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
•
3B
•
624k
•
543
HuggingFaceFW/fineweb
Viewer
•
Updated
•
46B
•
384k
•
1.75k
HuggingFaceFW/fineweb-edu-llama3-annotations
Viewer
•
Updated
•
467k
•
251
•
36
HuggingFaceFW/fineweb-edu-score-2
Viewer
•
Updated
•
11.8B
•
29.7k
•
60
HuggingFaceFW/admin
Viewer
•
Updated
•
2
•
7.73k
•
3