File size: 4,258 Bytes
3d709db
 
ae2b777
 
 
 
3d709db
ae2b777
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189ab59
ae2b777
 
 
1abf532
ae2b777
1abf532
ae2b777
1abf532
 
 
 
 
 
 
ae2b777
aaafba7
 
a244173
 
aaafba7
ae2b777
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: cc0-1.0
language:
- is
tags:
- MaCoCu
---

# Model description

**XLMR-MaCoCu-is** is a large pre-trained language model trained on **Icelandic** texts. It was created by continuing training from the [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project and only uses data that was crawled during the project. The main developer is [Rik van Noord](https://www.rikvannoord.nl/) from the University of Groningen.

XLMR-MaCoCu-is was trained on 4.4GB of Icelandic text, which is equal to 688M tokens. It was trained for 75,000 steps with a batch size of 1,024. It uses the same vocabulary as the original XLMR-large model.

The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).

# How to use

```python
from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaCoCu-is")
model = AutoModel.from_pretrained("RVN/XLMR-MaCoCu-is") # PyTorch
model = TFAutoModel.from_pretrained("RVN/XLMR-MaCoCu-is") # Tensorflow
```

# Data

For training, we used all Icelandic data that was present in the monolingual Icelandic [MaCoCu](https://macocu.eu/) corpus. After de-duplicating the data, we were left with a total of 4.4 GB of text, which equals 688M tokens.

# Benchmark performance

We tested the performance of **XLMR-MaCoCu-is** on benchmarks of XPOS, UPOS, NER and COPA. For UPOS and XPOS, we used the data from the [Universal Dependencies](https://universaldependencies.org/) project. For NER, we used the data from the MIM-GOLD-NER data set. For COPA, we automatically translated the English data set by using Google Translate. For details please see our [Github repo](https://github.com/RikVN/COPA). We compare performance to the strong multi-lingual models XLMR-base and XLMR-large, but also the monolingual [IceBERT](https://huggingface.co/vesteinn/IceBERT) model. For details regarding the XPOS/UPOS/NER fine-tuning procedure you can checkout our [Github](https://github.com/macocu/LanguageModels).

Scores are averages of three runs, except for COPA, for which we use 10 runs. We use the same hyperparameter settings for all models.

|                    | **UPOS** | **UPOS** | **XPOS** | **XPOS** | **NER** | **NER**  | **COPA**  |
|--------------------|:--------:|:--------:|:--------:|:--------:|---------|----------| ----------|
|                    |  **Dev** | **Test** |  **Dev** | **Test** | **Dev** | **Test** | **Test** |
| **XLM-R-base**     |   96.8   |   96.5   |   94.6   |   94.3   |   85.3  |   89.7   | 55.2 |
| **XLM-R-large**    |   97.0   |   96.7   |   94.9   |   94.7   |   88.5  |   91.7   | 54.3 |
| **IceBERT**        |   96.4   |   96.0   |   94.0   |   93.7   |   83.8  |   89.7   | 54.6 | 
| **XLMR-MaCoCu-is** |   **97.3**   |    **97.0**    |   **95.4**   |   **95.1**  |   **90.8**  |   **93.2**   | **59.6** |

# Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Union’s Connecting Europe Facility 2014-
2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).

# Citation

If you use this model, please cite the following paper:

```bibtex
@inproceedings{non-etal-2022-macocu,
    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
    author = "Ba{\~n}{\'o}n, Marta  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Garc{\'\i}a-Romero, Cristian  and
      Kuzman, Taja  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      van Noord, Rik  and
      Sempere, Leopoldo Pla  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Rupnik, Peter  and
      Suchomel, V{\'\i}t  and
      Toral, Antonio  and
      van der Werff, Tobias  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2022",
    address = "Ghent, Belgium",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2022.eamt-1.41",
    pages = "303--304"
}
```