is the dataset available?

#4
by fblgit - opened

I'd like to make a UNA version of this model, is the dataset available?

AI Singapore org

Hi,
SEA-LION is pretrained on publicly available datasets,

For the English portion, the RefinedWeb is used,
https://huggingface.co/datasets/tiiuae/falcon-refinedweb

For Southeast Asian languages, the mC4 3.1.0 dataset is used,
https://huggingface.co/datasets/allenai/c4/tree/mC4310

For code datasets, we used the Python, Javascript, Shell, SQL, Markdown from the stack dataset,
https://huggingface.co/datasets/bigcode/the-stack-dedup

Additionally, we include the StackExchange and ArXiv portion of the Red Pajama dataset,
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

Hope this helps.

Sign up or log in to comment