StevenLimcorn commited on
Commit
be522cf
1 Parent(s): d946185

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md CHANGED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ms
3
+ tags:
4
+ - melayu-bert
5
+ license: mit
6
+ datasets:
7
+ - oscar
8
+ widget:
9
+ - text: "Saya [MASK] makan nasi hari ini."
10
+ ---
11
+
12
+ ## Melayu BERT
13
+
14
+ Melayu BERT is a masked language model based on [BERT](https://arxiv.org/abs/1810.04805). It was trained on the [OSCAR](https://huggingface.co/datasets/oscar) dataset, specifically the `unshuffled_original_ms` subset. The model used was [English BERT model](https://huggingface.co/bert-base-uncased) and fine-tuned on the Malaysian dataset. The model achieved a perplexity of 9.46 on a 20% validation dataset. Many of the techniques used are based on a Hugging Face tutorial [notebook](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) written by [Sylvain Gugger](https://github.com/sgugger), and [fine-tuning tutorial notebook](https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) written by [Pierre Guillou](https://huggingface.co/pierreguillou). The model is available both for PyTorch and TensorFlow use.
15
+
16
+ ## Model
17
+
18
+ The model was trained on 3 epochs with a learning rate of 2e-3 and achieved a training loss per steps as shown below.
19
+
20
+ | Step |Training loss|
21
+ |--------|-------------|
22
+ |500 | 5.051300 |
23
+ |1000 | 3.701700 |
24
+ |1500 | 3.288600 |
25
+ |2000 | 3.024000 |
26
+ |2500 | 2.833500 |
27
+ |3000 | 2.741600 |
28
+ |3500 | 2.637900 |
29
+ |4000 | 2.547900 |
30
+ |4500 | 2.451500 |
31
+ |5000 | 2.409600 |
32
+ |5500 | 2.388300 |
33
+ |6000 | 2.351600 |
34
+
35
+ ## How to Use
36
+ ### As Masked Language Model
37
+ ```python
38
+ from transformers import pipeline
39
+ pretrained_name = "StevenLimcorn/MelayuBERT"
40
+ fill_mask = pipeline(
41
+ "fill-mask",
42
+ model=pretrained_name,
43
+ tokenizer=pretrained_name
44
+ )
45
+ fill_mask("Saya [MASK] makan nasi hari ini.")
46
+ ```
47
+
48
+ ### Import Tokenizer and Model
49
+ ```python
50
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
51
+
52
+ tokenizer = AutoTokenizer.from_pretrained("StevenLimcorn/MelayuBERT")
53
+
54
+ model = AutoModelForMaskedLM.from_pretrained("StevenLimcorn/MelayuBERT")
55
+ ```
56
+ ## Author
57
+ Melayu BERT was trained by [Steven Limcorn](https://github.com/stevenlimcorn) and [Wilson Wongso](https://hf.co/w11wo).