Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepΒ 
posted an update Jun 8
Post
1367
πŸ“ˆ One of the biggest changes in Llama 3 was the training dataset, which grew by 7X over Llama 2 (2T to 15T tokens) πŸš€

While Meta did not open source the dataset, it sparked a thought... what would happen if everyone had access to a big, high-quality dataset? πŸ€”

To address that, in April this year, @huggingface released FineWeb, a 15T token open-source dataset 🌍

And now they are releasing FineWeb Technical Report and FineWeb Edu πŸ“š

πŸ† 15T tokens in FineWeb outperforming other open datasets
πŸŽ“ 1.3T highest-quality educational dataset FineWeb-Edu
πŸ“˜ 5.4T high-quality educational tokens in FineWeb-Edu-2

FineWeb Edu outperforms other datasets on MMLU, ARC, OpenBookQA πŸ“ˆ

ODC-By 1.0 license πŸ“œ

Report: HuggingFaceFW/blogpost-fineweb-v1