Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
Inference Endpoints
TajaKuzman commited on
Commit
39047a3
1 Parent(s): 874e90e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -2
README.md CHANGED
@@ -112,11 +112,15 @@ widget:
112
 
113
  ---
114
 
115
- # Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres
116
 
117
  Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
118
 
119
- ## Genre categories
 
 
 
 
120
 
121
  List of labels:
122
  ```
@@ -157,6 +161,13 @@ model_args= {
157
 
158
  ## Usage
159
 
 
 
 
 
 
 
 
160
  ### Use examples
161
 
162
  ```python
@@ -185,6 +196,43 @@ predictions
185
 
186
  ## Performance
187
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
 
189
  ## Citation
190
 
 
112
 
113
  ---
114
 
115
+ # Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres - X-GENRE classifier
116
 
117
  Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
118
 
119
+ ## Model description
120
+
121
+ The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on cross-dataset experiments (described in details [here](https://github.com/TajaKuzman/Genre-Datasets-Comparison)). The joint schema was mapped to the datasets in the merged dataset.
122
+
123
+ ## X-GENRE categories
124
 
125
  List of labels:
126
  ```
 
161
 
162
  ## Usage
163
 
164
+ An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
165
+
166
+ For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words). It is advised that the predictions, predicted with confidence prediction lower than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not enough features of any genre, and these predictions can be discarded as well.
167
+
168
+ After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection is macro F1: 0.92, micro F1: 0.92.
169
+
170
+
171
  ### Use examples
172
 
173
  ```python
 
196
 
197
  ## Performance
198
 
199
+ ### Comparison with other models at in-dataset and cross-dataset experiments
200
+
201
+ The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schemata (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
202
+
203
+ At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
204
+
205
+ | Trained on | Micro F1 | Macro F1 |
206
+ |:-------------|-----------:|-----------:|
207
+ | FTD | 0.843 | 0.851 |
208
+ | X-GENRE | 0.797 | 0.794 |
209
+ | CORE | 0.778 | 0.627 |
210
+ | GINCO | 0.754 | 0.75 |
211
+
212
+ At the cross-dataset experiments (trained on X-GENRE dataset, tested on splits of separate genre datasets), the classifier performs well:
213
+
214
+ | Trained on | Tested on | Micro F1 | Macro F1 |
215
+ |:-------------|:------------|-----------:|-----------:|
216
+ | X-GENRE | CORE | 0.837 | 0.859 |
217
+ | X-GENRE | FTD | 0.804 | 0.809 |
218
+ | X-GENRE | X-GENRE | 0.797 | 0.794 |
219
+ | X-GENRE | X-GENRE-dev | 0.784 | 0.784 |
220
+ | X-GENRE | SI-GINCO | 0.749 | 0.758 |
221
+
222
+ The classifiers was compared with classifiers (which were trained only on one of the datasets) on 2 additional genre datasets (based on the joint mapping) on which it was not fine-tuned:
223
+ - EN-GINCO: English enTenTen20 corpus, annotated with GINCO labels
224
+ - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus, annotated with CORE labels
225
+
226
+ | Trained on | Tested on | Micro F1 | Macro F1 |
227
+ |:-------------|:------------|-----------:|-----------:|
228
+ | X-GENRE | EN-GINCO | 0.688 | 0.691 |
229
+ | X-GENRE | FinCORE | 0.674 | 0.581 |
230
+ | MT-GINCO | EN-GINCO | 0.654 | 0.538 |
231
+ | SI-GINCO | EN-GINCO | 0.632 | 0.502 |
232
+ | FTD | EN-GINCO | 0.574 | 0.475 |
233
+ | CORE | EN-GINCO | 0.485 | 0.422 |
234
+
235
+ At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier, trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
236
 
237
  ## Citation
238