scale-up? or full datasets list?
#11
by
lucyknada
- opened
the 1.7b already works great but is sometimes missing simply because of its size I assume; is something in the range of 3b/4b planned? or could the full datasets list be released? thanks!
Hi, that's in the roadmap. Regarding the datasets, we use a mix of FineWeb-Edu, DCLM and The Stack with new math and code datasets that we will release in the upcoming weeks with a tech report.