scale-up? or full datasets list?

#11
by lucyknada - opened

the 1.7b already works great but is sometimes missing simply because of its size I assume; is something in the range of 3b/4b planned? or could the full datasets list be released? thanks!

Hugging Face TB Research org

Hi, that's in the roadmap. Regarding the datasets, we use a mix of FineWeb-Edu, DCLM and The Stack with new math and code datasets that we will release in the upcoming weeks with a tech report.

Sign up or log in to comment