MLRS
/

Edit model card

BERTu

A Maltese monolingual model pre-trained from scratch on the Korpus Malti v4.0 using the BERT (base) architecture.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at https://mlrs.research.um.edu.mt/.

CC BY-NC-SA 4.0

Citation

This work was first presented in Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese. Cite it as follows:

@inproceedings{BERTu,
    title = "Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and {BERT} Models for {M}altese",
    author = "Micallef, Kurt  and
              Gatt, Albert  and
              Tanti, Marc  and
              van der Plas, Lonneke  and
              Borg, Claudia",
    booktitle = "Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing",
    month = jul,
    year = "2022",
    address = "Hybrid",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.deeplo-1.10",
    doi = "10.18653/v1/2022.deeplo-1.10",
    pages = "90--101",
}
Downloads last month
96
Safetensors
Model size
126M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train MLRS/BERTu

Evaluation results

  • Unlabelled Attachment Score on Maltese Universal Dependencies Treebank (MUDT)
    self-reported
    92.310
  • Labelled Attachment Score on Maltese Universal Dependencies Treebank (MUDT)
    self-reported
    88.140
  • UPOS Accuracy on MLRS POS dataset
    self-reported
    98.580
  • XPOS Accuracy on MLRS POS dataset
    self-reported
    98.540
  • Span-based F1 on WikiAnn (Maltese)
    self-reported
    86.770
  • Macro-averaged F1 on Maltese Sentiment Analysis Dataset
    self-reported
    78.960