Add citation for Thai dataset

#2
Files changed (1) hide show
  1. README.md +16 -2
README.md CHANGED
@@ -52,7 +52,8 @@ SEA-LION was trained on 980B tokens of the following data:
52
  | mC4 - Filipino | 5.3B | 0.54% |
53
  | mC4 - Burmese | 4.9B | 0.49% |
54
  | mC4 - Vietnamese | 63.4B | 6.46% |
55
- | mC4 - Thai | 21.6B | 2.20% |
 
56
  | mC4 - Lao | 1.1B | 0.12% |
57
  | mC4 - Khmer | 3.9B | 0.40% |
58
  | mC4 - Tamil | 10.2B | 1.04% |
@@ -152,4 +153,17 @@ This the repository for the base model.
152
  The model has _not_ been aligned for safety.
153
  Developers and users should perform their own safety fine-tuning and related security measures.
154
  In no event shall the authors be held liable for any claim, damages, or other liability
155
- arising from the use of the released weights and codes.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  | mC4 - Filipino | 5.3B | 0.54% |
53
  | mC4 - Burmese | 4.9B | 0.49% |
54
  | mC4 - Vietnamese | 63.4B | 6.46% |
55
+ | mC4 - Thai | 11.6B | 1.18% |
56
+ | WangChanBERTa - Thai | 10B | 1.02% |
57
  | mC4 - Lao | 1.1B | 0.12% |
58
  | mC4 - Khmer | 3.9B | 0.40% |
59
  | mC4 - Tamil | 10.2B | 1.04% |
 
153
  The model has _not_ been aligned for safety.
154
  Developers and users should perform their own safety fine-tuning and related security measures.
155
  In no event shall the authors be held liable for any claim, damages, or other liability
156
+ arising from the use of the released weights and codes.
157
+
158
+ ## Citations
159
+
160
+ ```bibtex
161
+ @misc{lowphansirikul2021wangchanberta,
162
+ title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
163
+ author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
164
+ year={2021},
165
+ eprint={2101.09635},
166
+ archivePrefix={arXiv},
167
+ primaryClass={cs.CL}
168
+ }
169
+ ```