README.md · StevenLimcorn/MelayuBERT at main

metadata

language: ms
tags:
  - melayu-bert
license: mit
datasets:
  - oscar
widget:
  - text: Saya [MASK] makan nasi hari ini.

Melayu BERT

Melayu BERT is a masked language model based on BERT. It was trained on the OSCAR dataset, specifically the unshuffled_original_ms subset. The model used was English BERT model and fine-tuned on the Malaysian dataset. The model achieved a perplexity of 9.46 on a 20% validation dataset. Many of the techniques used are based on a Hugging Face tutorial notebook written by Sylvain Gugger, and fine-tuning tutorial notebook written by Pierre Guillou. The model is available both for PyTorch and TensorFlow use.

Model

The model was trained on 3 epochs with a learning rate of 2e-3 and achieved a training loss per steps as shown below.

Step	Training loss
500	5.051300
1000	3.701700
1500	3.288600
2000	3.024000
2500	2.833500
3000	2.741600
3500	2.637900
4000	2.547900
4500	2.451500
5000	2.409600
5500	2.388300
6000	2.351600

How to Use

As Masked Language Model

from transformers import pipeline
pretrained_name = "StevenLimcorn/MelayuBERT"
fill_mask = pipeline(
    "fill-mask",
    model=pretrained_name,
    tokenizer=pretrained_name
)
fill_mask("Saya [MASK] makan nasi hari ini.")

Import Tokenizer and Model

from transformers import AutoTokenizer, AutoModelForMaskedLM
  
tokenizer = AutoTokenizer.from_pretrained("StevenLimcorn/MelayuBERT")

model = AutoModelForMaskedLM.from_pretrained("StevenLimcorn/MelayuBERT")

Author

Melayu BERT was trained by Steven Limcorn and Wilson Wongso.