pdelobelle commited on
Commit
f4bf122
1 Parent(s): b0fbc19

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -19
README.md CHANGED
@@ -3,33 +3,75 @@ language: "nl"
3
  thumbnail: "https://github.com/iPieter/RobBERT/raw/master/res/robbert_logo.png"
4
  tags:
5
  - Dutch
 
6
  - RoBERTa
7
  - RobBERT
8
  license: mit
9
  datasets:
10
  - oscar
11
- - Shuffled Dutch section of the OSCAR corpus (https://oscar-corpus.com/)
 
 
 
 
 
 
12
  ---
13
 
14
- # RobBERT
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- ## Model description
 
17
 
18
- [RobBERT v2](https://github.com/iPieter/RobBERT) is a Dutch state-of-the-art [RoBERTa](https://arxiv.org/abs/1907.11692)-based language model.
 
 
 
 
19
 
20
- More detailled information can be found in the [RobBERT paper](https://arxiv.org/abs/2001.06286).
21
 
22
  ## How to use
23
 
 
 
 
 
 
 
24
  ```python
25
  from transformers import RobertaTokenizer, RobertaForSequenceClassification
26
  tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")
27
  model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")
28
  ```
29
 
30
- ## Performance Evaluation Results
 
31
 
32
- All experiments are described in more detail in our [paper](https://arxiv.org/abs/2001.06286).
 
 
 
 
 
 
33
 
34
  ### Sentiment analysis
35
  Predicting whether a review is positive or negative using the [Dutch Book Reviews Dataset](https://github.com/benjaminvdb/110kDBRD).
@@ -109,7 +151,7 @@ Using the [CoNLL 2002 evaluation script](https://www.clips.uantwerpen.be/conll20
109
  | RobBERT v2 | 89.08 |
110
 
111
 
112
- ## Training procedure
113
 
114
  We pre-trained RobBERT using the RoBERTa training regime.
115
  We pre-trained our model on the Dutch section of the [OSCAR corpus](https://oscar-corpus.com/), a large multilingual corpus which was obtained by language classification in the Common Crawl corpus.
@@ -132,7 +174,7 @@ Using the [Fairseq library](https://github.com/pytorch/fairseq/tree/master/examp
132
  In between training jobs on the computing cluster, 2 Nvidia 1080 Ti's also covered some parameter updates for RobBERT v2.
133
 
134
 
135
- ## Limitations and bias
136
 
137
  In the [RobBERT paper](https://arxiv.org/abs/2001.06286), we also investigated potential sources of bias in RobBERT.
138
 
@@ -150,15 +192,35 @@ By augmenting the DBRB Dutch Book sentiment analysis dataset with the stated gen
150
 
151
 
152
 
153
- ## BibTeX entry and citation info
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
 
155
- ```bibtex
156
- @misc{delobelle2020robbert,
157
- title={RobBERT: a Dutch RoBERTa-based Language Model},
158
- author={Pieter Delobelle and Thomas Winters and Bettina Berendt},
159
- year={2020},
160
- eprint={2001.06286},
161
- archivePrefix={arXiv},
162
- primaryClass={cs.CL}
163
- }
164
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  thumbnail: "https://github.com/iPieter/RobBERT/raw/master/res/robbert_logo.png"
4
  tags:
5
  - Dutch
6
+ - Flemish
7
  - RoBERTa
8
  - RobBERT
9
  license: mit
10
  datasets:
11
  - oscar
12
+ - oscar (NL)
13
+ - dbrd
14
+ - lassy-ud
15
+ - europarl-mono
16
+ - conll2002
17
+ widget:
18
+ - text: "Hallo, ik ben RobBERT, een <mask> taalmodel van de KU Leuven"
19
  ---
20
 
21
+ <p align="center">
22
+ <img src="https://github.com/iPieter/RobBERT/raw/master/res/robbert_logo_with_name.png" alt="RobBERT: A Dutch RoBERTa-based Language Model" width="75%">
23
+ </p>
24
+
25
+ # RobBERT: Dutch RoBERTa-based Language Model.
26
+
27
+ [RobBERT](https://github.com/iPieter/RobBERT) is the state-of-the-art Dutch BERT model. It is a large pre-trained general Dutch language model that can be fine-tuned on a given dataset to perform any text classification, regression or token-tagging task. As such, it has been successfully used by many [researchers](https://scholar.google.com/scholar?oi=bibs&hl=en&cites=7180110604335112086) and [practitioners](https://huggingface.co/models?search=robbert) for achieving state-of-the-art performance for a wide range of Dutch natural language processing tasks, including:
28
+
29
+ - [Emotion detection](https://www.aclweb.org/anthology/2021.wassa-1.27/)
30
+ - Sentiment analysis ([book reviews](https://arxiv.org/pdf/2001.06286.pdf), [news articles](https://biblio.ugent.be/publication/8704637/file/8704638.pdf)*)
31
+ - [Coreference resolution](https://arxiv.org/pdf/2001.06286.pdf)
32
+ - Named entity recognition ([CoNLL](https://arxiv.org/pdf/2001.06286.pdf), [job titles](https://arxiv.org/pdf/2004.02814.pdf)*, [SoNaR](https://github.com/proycon/deepfrog))
33
+ - Part-of-speech tagging ([Small UD Lassy](https://arxiv.org/pdf/2001.06286.pdf), [CGN](https://github.com/proycon/deepfrog))
34
+ - [Zero-shot word prediction](https://arxiv.org/pdf/2001.06286.pdf)
35
+ - [Humor detection](https://arxiv.org/pdf/2010.13652.pdf)
36
+ - [Cyberbulling detection](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/automatic-classification-of-participant-roles-in-cyberbullying-can-we-detect-victims-bullies-and-bystanders-in-social-media-text/A2079C2C738C29428E666810B8903342)
37
+ - [Correcting dt-spelling mistakes](https://gitlab.com/spelfouten/dutch-simpletransformers/)*
38
+
39
+ and also achieved outstanding, near-sota results for:
40
 
41
+ - [Natural language inference](https://arxiv.org/pdf/2101.05716.pdf)*
42
+ - [Review classification](https://medium.com/broadhorizon-cmotions/nlp-with-r-part-5-state-of-the-art-in-nlp-transformers-bert-3449e3cd7494)*
43
 
44
+ \* *Note that several evaluations use RobBERT-v1, and that the second and improved RobBERT-v2 outperforms this first model on everything we tested*
45
+
46
+ *(Also note that this list is not exhaustive. If you used RobBERT for your application, we are happy to know about it! Send us a mail, or add it yourself to this list by sending a pull request with the edit!)*
47
+
48
+ More in-depth information about RobBERT can be found in our [blog post](https://people.cs.kuleuven.be/~pieter.delobelle/robbert/), [our paper](https://arxiv.org/abs/2001.06286) and [the RobBERT Github repository](https://github.com/iPieter/RobBERT)
49
 
 
50
 
51
  ## How to use
52
 
53
+ RobBERT uses the [RoBERTa](https://arxiv.org/abs/1907.11692) architecture and pre-training but with a Dutch tokenizer and training data. RoBERTa is the robustly optimized English BERT model, making it even more powerful than the original BERT model. Given this same architecture, RobBERT can easily be finetuned and inferenced using [code to finetune RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html) models and most code used for BERT models, e.g. as provided by [HuggingFace Transformers](https://huggingface.co/transformers/) library.
54
+
55
+ By default, RobBERT has the masked language model head used in training. This can be used as a zero-shot way to fill masks in sentences. It can be tested out for free on [RobBERT's Hosted infererence API of Huggingface](https://huggingface.co/pdelobelle/robbert-v2-dutch-base?text=De+hoofdstad+van+Belgi%C3%AB+is+%3Cmask%3E.). You can also create a new prediction head for your own task by using any of HuggingFace's [RoBERTa-runners](https://huggingface.co/transformers/v2.7.0/examples.html#language-model-training), [their fine-tuning notebooks](https://huggingface.co/transformers/v4.1.1/notebooks.html) by changing the model name to `pdelobelle/robbert-v2-dutch-base`, or use the original fairseq [RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta) training regimes.
56
+
57
+ Use the following code to download the base model and finetune it yourself, or use one of our finetuned models (documented on [our project site](https://people.cs.kuleuven.be/~pieter.delobelle/robbert/)).
58
+
59
  ```python
60
  from transformers import RobertaTokenizer, RobertaForSequenceClassification
61
  tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")
62
  model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")
63
  ```
64
 
65
+ Starting with `transformers v2.4.0` (or installing from source), you can use AutoTokenizer and AutoModel.
66
+ You can then use most of [HuggingFace's BERT-based notebooks](https://huggingface.co/transformers/v4.1.1/notebooks.html) for finetuning RobBERT on your type of Dutch language dataset.
67
 
68
+
69
+ ## Technical Details From The Paper
70
+
71
+
72
+ ### Our Performance Evaluation Results
73
+
74
+ All experiments are described in more detail in our [paper](https://arxiv.org/abs/2001.06286), with the code in [our GitHub repository](https://github.com/iPieter/RobBERT).
75
 
76
  ### Sentiment analysis
77
  Predicting whether a review is positive or negative using the [Dutch Book Reviews Dataset](https://github.com/benjaminvdb/110kDBRD).
 
151
  | RobBERT v2 | 89.08 |
152
 
153
 
154
+ ## Pre-Training Procedure Details
155
 
156
  We pre-trained RobBERT using the RoBERTa training regime.
157
  We pre-trained our model on the Dutch section of the [OSCAR corpus](https://oscar-corpus.com/), a large multilingual corpus which was obtained by language classification in the Common Crawl corpus.
 
174
  In between training jobs on the computing cluster, 2 Nvidia 1080 Ti's also covered some parameter updates for RobBERT v2.
175
 
176
 
177
+ ## Investigating Limitations and Bias
178
 
179
  In the [RobBERT paper](https://arxiv.org/abs/2001.06286), we also investigated potential sources of bias in RobBERT.
180
 
 
192
 
193
 
194
 
195
+ ## How to Replicate Our Paper Experiments
196
+ Replicating our paper experiments is [described in detail on teh RobBERT repository README](https://github.com/iPieter/RobBERT#how-to-replicate-our-paper-experiments).
197
+
198
+ ## Name Origin of RobBERT
199
+
200
+ Most BERT-like models have the word *BERT* in their name (e.g. [RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html), [ALBERT](https://arxiv.org/abs/1909.11942), [CamemBERT](https://camembert-model.fr/), and [many, many others](https://huggingface.co/models?search=bert)).
201
+ As such, we queried our newly trained model using its masked language model to name itself *\<mask\>bert* using [all](https://huggingface.co/pdelobelle/robbert-v2-dutch-base?text=Mijn+naam+is+%3Cmask%3Ebert.) [kinds](https://huggingface.co/pdelobelle/robbert-v2-dutch-base?text=Hallo%2C+ik+ben+%3Cmask%3Ebert.) [of](https://huggingface.co/pdelobelle/robbert-v2-dutch-base?text=Leuk+je+te+ontmoeten%2C+ik+heet+%3Cmask%3Ebert.) [prompts](https://huggingface.co/pdelobelle/robbert-v2-dutch-base?text=Niemand+weet%2C+niemand+weet%2C+dat+ik+%3Cmask%3Ebert+heet.), and it consistently called itself RobBERT.
202
+ We thought it was really quite fitting, given that RobBERT is a [*very* Dutch name](https://en.wikipedia.org/wiki/Robbert) *(and thus clearly a Dutch language model)*, and additionally has a high similarity to its root architecture, namely [RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html).
203
+
204
+ Since *"rob"* is a Dutch words to denote a seal, we decided to draw a seal and dress it up like [Bert from Sesame Street](https://muppet.fandom.com/wiki/Bert) for the [RobBERT logo](https://github.com/iPieter/RobBERT/blob/master/res/robbert_logo.png).
205
+
206
+ ## Credits and citation
207
+
208
+ This project is created by [Pieter Delobelle](https://people.cs.kuleuven.be/~pieter.delobelle), [Thomas Winters](https://thomaswinters.be) and [Bettina Berendt](https://people.cs.kuleuven.be/~bettina.berendt/).
209
+ If you would like to cite our paper or model, you can use the following BibTeX:
210
 
 
 
 
 
 
 
 
 
 
211
  ```
212
+ @inproceedings{delobelle2020robbert,
213
+ title = "{R}ob{BERT}: a {D}utch {R}o{BERT}a-based {L}anguage {M}odel",
214
+ author = "Delobelle, Pieter and
215
+ Winters, Thomas and
216
+ Berendt, Bettina",
217
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
218
+ month = nov,
219
+ year = "2020",
220
+ address = "Online",
221
+ publisher = "Association for Computational Linguistics",
222
+ url = "https://www.aclweb.org/anthology/2020.findings-emnlp.292",
223
+ doi = "10.18653/v1/2020.findings-emnlp.292",
224
+ pages = "3255--3265"
225
+ }
226
+ ```