dv-labse / README.md
monsoon-nlp's picture
training with tokenizer on
7027ae8
metadata
language: dv

dv-labse

This is an experiment in cross-lingual transfer learning, to insert Dhivehi word and word-piece tokens into Google's LaBSE model.

This currently outperforms dv-wave and dv-MuRIL (a similar transfer learning model) on the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

  • mBERT: 52%
  • dv-wave (ELECTRA): 89%
  • dv-muril: 90.7%
  • dv-labse: 91.3-91.5% (may continue training)

Training

  • Start with LaBSE (similar to mBERT) with no Thaana vocabulary
  • Based on PanLex dictionaries, attach 1,100 Dhivehi words to Sinhalese or English embeddings
  • Add remaining words and word-pieces from dv-wave's vocabulary to vocab.txt
  • Continue BERT pretraining on Dhivehi text

CoLab notebook: https://colab.research.google.com/drive/1CUn44M2fb4Qbat2pAvjYqsPvWLt1Novi