TajaKuzman
commited on
Commit
•
f98b0d9
1
Parent(s):
39047a3
Update README.md
Browse files
README.md
CHANGED
@@ -114,11 +114,11 @@ widget:
|
|
114 |
|
115 |
# Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres - X-GENRE classifier
|
116 |
|
117 |
-
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three
|
118 |
|
119 |
## Model description
|
120 |
|
121 |
-
The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on cross-dataset experiments (described in details [here](https://github.com/TajaKuzman/Genre-Datasets-Comparison)).
|
122 |
|
123 |
## X-GENRE categories
|
124 |
|
@@ -146,7 +146,7 @@ Description of labels:
|
|
146 |
|
147 |
### Fine-tuning hyperparameters
|
148 |
|
149 |
-
Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
|
150 |
|
151 |
```python
|
152 |
model_args= {
|
@@ -163,9 +163,9 @@ model_args= {
|
|
163 |
|
164 |
An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
|
165 |
|
166 |
-
For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words). It is advised that the predictions, predicted with confidence
|
167 |
|
168 |
-
After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection
|
169 |
|
170 |
|
171 |
### Use examples
|
@@ -198,7 +198,7 @@ predictions
|
|
198 |
|
199 |
### Comparison with other models at in-dataset and cross-dataset experiments
|
200 |
|
201 |
-
The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE
|
202 |
|
203 |
At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
|
204 |
|
@@ -209,7 +209,7 @@ At the in-dataset experiments (trained and tested on splits of the same dataset)
|
|
209 |
| CORE | 0.778 | 0.627 |
|
210 |
| GINCO | 0.754 | 0.75 |
|
211 |
|
212 |
-
|
213 |
|
214 |
| Trained on | Tested on | Micro F1 | Macro F1 |
|
215 |
|:-------------|:------------|-----------:|-----------:|
|
@@ -217,18 +217,17 @@ At the cross-dataset experiments (trained on X-GENRE dataset, tested on splits o
|
|
217 |
| X-GENRE | FTD | 0.804 | 0.809 |
|
218 |
| X-GENRE | X-GENRE | 0.797 | 0.794 |
|
219 |
| X-GENRE | X-GENRE-dev | 0.784 | 0.784 |
|
220 |
-
| X-GENRE |
|
221 |
|
222 |
-
The
|
223 |
-
- EN-GINCO:
|
224 |
-
- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
|
225 |
|
226 |
| Trained on | Tested on | Micro F1 | Macro F1 |
|
227 |
|:-------------|:------------|-----------:|-----------:|
|
228 |
| X-GENRE | EN-GINCO | 0.688 | 0.691 |
|
229 |
| X-GENRE | FinCORE | 0.674 | 0.581 |
|
230 |
-
|
|
231 |
-
| SI-GINCO | EN-GINCO | 0.632 | 0.502 |
|
232 |
| FTD | EN-GINCO | 0.574 | 0.475 |
|
233 |
| CORE | EN-GINCO | 0.485 | 0.422 |
|
234 |
|
|
|
114 |
|
115 |
# Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres - X-GENRE classifier
|
116 |
|
117 |
+
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three genre datasets: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
|
118 |
|
119 |
## Model description
|
120 |
|
121 |
+
The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on the comparison of labels and cross-dataset experiments (described in details [here](https://github.com/TajaKuzman/Genre-Datasets-Comparison)).
|
122 |
|
123 |
## X-GENRE categories
|
124 |
|
|
|
146 |
|
147 |
### Fine-tuning hyperparameters
|
148 |
|
149 |
+
Fine-tuning was performed with `simpletransformers`. Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
|
150 |
|
151 |
```python
|
152 |
model_args= {
|
|
|
163 |
|
164 |
An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
|
165 |
|
166 |
+
For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words). It is advised that the predictions, predicted with confidence lower than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not enough features of any genre, and these predictions can be discarded as well.
|
167 |
|
168 |
+
After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection reached macro and micro F1 of 0.92.
|
169 |
|
170 |
|
171 |
### Use examples
|
|
|
198 |
|
199 |
### Comparison with other models at in-dataset and cross-dataset experiments
|
200 |
|
201 |
+
The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
|
202 |
|
203 |
At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
|
204 |
|
|
|
209 |
| CORE | 0.778 | 0.627 |
|
210 |
| GINCO | 0.754 | 0.75 |
|
211 |
|
212 |
+
When applied on test splits of each of the datasets, the classifier performs well:
|
213 |
|
214 |
| Trained on | Tested on | Micro F1 | Macro F1 |
|
215 |
|:-------------|:------------|-----------:|-----------:|
|
|
|
217 |
| X-GENRE | FTD | 0.804 | 0.809 |
|
218 |
| X-GENRE | X-GENRE | 0.797 | 0.794 |
|
219 |
| X-GENRE | X-GENRE-dev | 0.784 | 0.784 |
|
220 |
+
| X-GENRE | GINCO | 0.749 | 0.758 |
|
221 |
|
222 |
+
The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
|
223 |
+
- EN-GINCO: a sample of the English enTenTen20 corpus
|
224 |
+
- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
|
225 |
|
226 |
| Trained on | Tested on | Micro F1 | Macro F1 |
|
227 |
|:-------------|:------------|-----------:|-----------:|
|
228 |
| X-GENRE | EN-GINCO | 0.688 | 0.691 |
|
229 |
| X-GENRE | FinCORE | 0.674 | 0.581 |
|
230 |
+
| GINCO | EN-GINCO | 0.632 | 0.502 |
|
|
|
231 |
| FTD | EN-GINCO | 0.574 | 0.475 |
|
232 |
| CORE | EN-GINCO | 0.485 | 0.422 |
|
233 |
|