aisingapore
/

sea-lion-3b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Add citation for Thai dataset

#2

by RaymondAISG - opened Jan 3

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

Files changed (1) hide show

README.md +16 -2

README.md CHANGED Viewed

@@ -52,7 +52,8 @@ SEA-LION was trained on 980B tokens of the following data:
 | mC4 - Filipino            |   5.3B |      0.54% |
 | mC4 - Burmese             |   4.9B |      0.49% |
 | mC4 - Vietnamese          |  63.4B |      6.46% |
-| mC4 - Thai                |  21.6B |      2.20% |
 | mC4 - Lao                 |   1.1B |      0.12% |
 | mC4 - Khmer               |   3.9B |      0.40% |
 | mC4 - Tamil               |  10.2B |      1.04% |
@@ -152,4 +153,17 @@ This the repository for the base model.
 The model has _not_ been aligned for safety.
 Developers and users should perform their own safety fine-tuning and related security measures.
 In no event shall the authors be held liable for any claim, damages, or other liability
-arising from the use of the released weights and codes.

 | mC4 - Filipino            |   5.3B |      0.54% |
 | mC4 - Burmese             |   4.9B |      0.49% |
 | mC4 - Vietnamese          |  63.4B |      6.46% |
+| mC4 - Thai                |  11.6B |      1.18% |
+| WangChanBERTa - Thai      |    10B |      1.02% |
 | mC4 - Lao                 |   1.1B |      0.12% |
 | mC4 - Khmer               |   3.9B |      0.40% |
 | mC4 - Tamil               |  10.2B |      1.04% |
 The model has _not_ been aligned for safety.
 Developers and users should perform their own safety fine-tuning and related security measures.
 In no event shall the authors be held liable for any claim, damages, or other liability
+arising from the use of the released weights and codes.
+## Citations
+```bibtex
+@misc{lowphansirikul2021wangchanberta,
+    title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
+    author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
+    year={2021},
+    eprint={2101.09635},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```