File size: 4,519 Bytes

d584337
cfe4fab
d584337
cfe4fab
 
 
57a25d1
cfe4fab
 
 
 
 
 
 
 
 
 
 
8cb7a69
cfe4fab
 
 
2800507
cfe4fab
 
 
 
 
d584337
cfe4fab
 
2800507
cfe4fab
 
2800507
cfe4fab
e1eaeb2
 
 
 
 
5cc4b63
c9f9151
5cc4b63
 
 
2e9199b
cfe4fab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8cb7a69
 
 
 
 
 
 
 
 
 
 
 
cfe4fab

---
language: nl
license: mit
datasets:
- dbrd
model-index:
- name: robbertje-merged-dutch-sentiment
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      name: dbrd
      type: sentiment-analysis
      split: test
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.9294064748201439
widget:
- text: "Ik erken dat dit een boek is, daarmee is alles gezegd."
- text: "Prachtig verhaal, heel mooi verteld en een verrassend einde... Een topper!"
thumbnail: "https://github.com/iPieter/robbertje/raw/master/images/robbertje_logo_with_name.png"
tags:
- Dutch
- Flemish
- RoBERTa
- RobBERT
---

<p align="center"> 
    <img src="https://github.com/iPieter/robbertje/raw/master/images/robbertje_logo_with_name.png" alt="RobBERTje: A collection of distilled Dutch models" width="75%">
 </p>

# RobBERTje finetuned for sentiment analysis on DBRD

This is a finetuned model based on [RobBERTje (merged)](https://huggingface.co/DTAI-KULeuven/robbertje-1-gb-non-shuffled). We used [DBRD](https://huggingface.co/datasets/dbrd), which consists of book reviews from [hebban.nl](hebban.nl). Hence our example sentences about books. We did some limited experiments to test if this also works for other domains, but this was not exactly amazing. 

We released a distilled model and a `base`-sized model. Both models perform quite well, so there is only a slight performance tradeoff:


| Model          | Identifier                                                             | Layers | #Params.  | Accuracy  |
|----------------|------------------------------------------------------------------------|--------|-----------|-----------|
| RobBERT (v2)   | [`DTAI-KULeuven/robbert-v2-dutch-sentiment`](https://huggingface.co/DTAI-KULeuven/robbert-v2-dutch-sentiment)    | 12     | 116 M     |93.3*      | 
| RobBERTje - Merged (p=0.5)| [`DTAI-KULeuven/robbertje-merged-dutch-sentiment`](https://huggingface.co/DTAI-KULeuven/robbertje-merged-dutch-sentiment) | 6 | 74 M      |92.9       |

*The results of RobBERT are of a different run than the one reported in the paper.

# Training data and setup
We used the [Dutch Book Reviews Dataset (DBRD)](https://huggingface.co/datasets/dbrd) from van der Burgh et al. (2019).
Originally, these reviews got a five-star rating, but this has been converted to positive (⭐️⭐️⭐️⭐️ and ⭐️⭐️⭐️⭐️⭐️), neutral (⭐️⭐️⭐️) and negative (⭐️ and ⭐️⭐️). 
We used 19.5k reviews for the training set, 528 reviews for the validation set and 2224 to calculate the final accuracy.

The validation set was used to evaluate a random hyperparameter search over the learning rate, weight decay and gradient accumulation steps. 
The full training details are available in [`training_args.bin`](https://huggingface.co/DTAI-KULeuven/robbert-v2-dutch-sentiment/blob/main/training_args.bin) as a binary PyTorch file. 

# Limitations and biases
- The domain of the reviews is limited to book reviews.
- Most authors of the book reviews were women, which could have caused [a difference in performance for reviews written by men and women](https://www.aclweb.org/anthology/2020.findings-emnlp.292). 

## Credits and citation

This project is created by [Pieter Delobelle](https://people.cs.kuleuven.be/~pieter.delobelle), [Thomas Winters](https://thomaswinters.be) and [Bettina Berendt](https://people.cs.kuleuven.be/~bettina.berendt/).
If you would like to cite our paper or models, you can use the following BibTeX:

```
@article{Delobelle_Winters_Berendt_2021,
	title        = {RobBERTje: A Distilled Dutch BERT Model},
	author       = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
	year         = 2021,
	month        = {Dec.},
	journal      = {Computational Linguistics in the Netherlands Journal},
	volume       = 11,
	pages        = {125–140},
	url          = {https://www.clinjournal.org/clinj/article/view/131}
}


@inproceedings{delobelle2020robbert,
    title = "{R}ob{BERT}: a {D}utch {R}o{BERT}a-based {L}anguage {M}odel",
    author = "Delobelle, Pieter  and
      Winters, Thomas  and
      Berendt, Bettina",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.292",
    doi = "10.18653/v1/2020.findings-emnlp.292",
    pages = "3255--3265"
}
```