TajaKuzman
commited on
Commit
•
39047a3
1
Parent(s):
874e90e
Update README.md
Browse files
README.md
CHANGED
@@ -112,11 +112,15 @@ widget:
|
|
112 |
|
113 |
---
|
114 |
|
115 |
-
# Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres
|
116 |
|
117 |
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
|
118 |
|
119 |
-
##
|
|
|
|
|
|
|
|
|
120 |
|
121 |
List of labels:
|
122 |
```
|
@@ -157,6 +161,13 @@ model_args= {
|
|
157 |
|
158 |
## Usage
|
159 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
160 |
### Use examples
|
161 |
|
162 |
```python
|
@@ -185,6 +196,43 @@ predictions
|
|
185 |
|
186 |
## Performance
|
187 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
188 |
|
189 |
## Citation
|
190 |
|
|
|
112 |
|
113 |
---
|
114 |
|
115 |
+
# Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres - X-GENRE classifier
|
116 |
|
117 |
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
|
118 |
|
119 |
+
## Model description
|
120 |
+
|
121 |
+
The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on cross-dataset experiments (described in details [here](https://github.com/TajaKuzman/Genre-Datasets-Comparison)). The joint schema was mapped to the datasets in the merged dataset.
|
122 |
+
|
123 |
+
## X-GENRE categories
|
124 |
|
125 |
List of labels:
|
126 |
```
|
|
|
161 |
|
162 |
## Usage
|
163 |
|
164 |
+
An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
|
165 |
+
|
166 |
+
For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words). It is advised that the predictions, predicted with confidence prediction lower than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not enough features of any genre, and these predictions can be discarded as well.
|
167 |
+
|
168 |
+
After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection is macro F1: 0.92, micro F1: 0.92.
|
169 |
+
|
170 |
+
|
171 |
### Use examples
|
172 |
|
173 |
```python
|
|
|
196 |
|
197 |
## Performance
|
198 |
|
199 |
+
### Comparison with other models at in-dataset and cross-dataset experiments
|
200 |
+
|
201 |
+
The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schemata (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
|
202 |
+
|
203 |
+
At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
|
204 |
+
|
205 |
+
| Trained on | Micro F1 | Macro F1 |
|
206 |
+
|:-------------|-----------:|-----------:|
|
207 |
+
| FTD | 0.843 | 0.851 |
|
208 |
+
| X-GENRE | 0.797 | 0.794 |
|
209 |
+
| CORE | 0.778 | 0.627 |
|
210 |
+
| GINCO | 0.754 | 0.75 |
|
211 |
+
|
212 |
+
At the cross-dataset experiments (trained on X-GENRE dataset, tested on splits of separate genre datasets), the classifier performs well:
|
213 |
+
|
214 |
+
| Trained on | Tested on | Micro F1 | Macro F1 |
|
215 |
+
|:-------------|:------------|-----------:|-----------:|
|
216 |
+
| X-GENRE | CORE | 0.837 | 0.859 |
|
217 |
+
| X-GENRE | FTD | 0.804 | 0.809 |
|
218 |
+
| X-GENRE | X-GENRE | 0.797 | 0.794 |
|
219 |
+
| X-GENRE | X-GENRE-dev | 0.784 | 0.784 |
|
220 |
+
| X-GENRE | SI-GINCO | 0.749 | 0.758 |
|
221 |
+
|
222 |
+
The classifiers was compared with classifiers (which were trained only on one of the datasets) on 2 additional genre datasets (based on the joint mapping) on which it was not fine-tuned:
|
223 |
+
- EN-GINCO: English enTenTen20 corpus, annotated with GINCO labels
|
224 |
+
- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus, annotated with CORE labels
|
225 |
+
|
226 |
+
| Trained on | Tested on | Micro F1 | Macro F1 |
|
227 |
+
|:-------------|:------------|-----------:|-----------:|
|
228 |
+
| X-GENRE | EN-GINCO | 0.688 | 0.691 |
|
229 |
+
| X-GENRE | FinCORE | 0.674 | 0.581 |
|
230 |
+
| MT-GINCO | EN-GINCO | 0.654 | 0.538 |
|
231 |
+
| SI-GINCO | EN-GINCO | 0.632 | 0.502 |
|
232 |
+
| FTD | EN-GINCO | 0.574 | 0.475 |
|
233 |
+
| CORE | EN-GINCO | 0.485 | 0.422 |
|
234 |
+
|
235 |
+
At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier, trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
|
236 |
|
237 |
## Citation
|
238 |
|