is the dataset available?
#4
by
fblgit
- opened
I'd like to make a UNA version of this model, is the dataset available?
Hi,
SEA-LION is pretrained on publicly available datasets,
For the English portion, the RefinedWeb is used,
https://huggingface.co/datasets/tiiuae/falcon-refinedweb
For Southeast Asian languages, the mC4 3.1.0 dataset is used,
https://huggingface.co/datasets/allenai/c4/tree/mC4310
For code datasets, we used the Python, Javascript, Shell, SQL, Markdown from the stack dataset,
https://huggingface.co/datasets/bigcode/the-stack-dedup
Additionally, we include the StackExchange and ArXiv portion of the Red Pajama dataset,
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
Hope this helps.