Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
Inference Endpoints
TajaKuzman commited on
Commit
f98b0d9
1 Parent(s): 39047a3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -13
README.md CHANGED
@@ -114,11 +114,11 @@ widget:
114
 
115
  # Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres - X-GENRE classifier
116
 
117
- Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
118
 
119
  ## Model description
120
 
121
- The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on cross-dataset experiments (described in details [here](https://github.com/TajaKuzman/Genre-Datasets-Comparison)). The joint schema was mapped to the datasets in the merged dataset.
122
 
123
  ## X-GENRE categories
124
 
@@ -146,7 +146,7 @@ Description of labels:
146
 
147
  ### Fine-tuning hyperparameters
148
 
149
- Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
150
 
151
  ```python
152
  model_args= {
@@ -163,9 +163,9 @@ model_args= {
163
 
164
  An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
165
 
166
- For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words). It is advised that the predictions, predicted with confidence prediction lower than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not enough features of any genre, and these predictions can be discarded as well.
167
 
168
- After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection is macro F1: 0.92, micro F1: 0.92.
169
 
170
 
171
  ### Use examples
@@ -198,7 +198,7 @@ predictions
198
 
199
  ### Comparison with other models at in-dataset and cross-dataset experiments
200
 
201
- The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schemata (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
202
 
203
  At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
204
 
@@ -209,7 +209,7 @@ At the in-dataset experiments (trained and tested on splits of the same dataset)
209
  | CORE | 0.778 | 0.627 |
210
  | GINCO | 0.754 | 0.75 |
211
 
212
- At the cross-dataset experiments (trained on X-GENRE dataset, tested on splits of separate genre datasets), the classifier performs well:
213
 
214
  | Trained on | Tested on | Micro F1 | Macro F1 |
215
  |:-------------|:------------|-----------:|-----------:|
@@ -217,18 +217,17 @@ At the cross-dataset experiments (trained on X-GENRE dataset, tested on splits o
217
  | X-GENRE | FTD | 0.804 | 0.809 |
218
  | X-GENRE | X-GENRE | 0.797 | 0.794 |
219
  | X-GENRE | X-GENRE-dev | 0.784 | 0.784 |
220
- | X-GENRE | SI-GINCO | 0.749 | 0.758 |
221
 
222
- The classifiers was compared with classifiers (which were trained only on one of the datasets) on 2 additional genre datasets (based on the joint mapping) on which it was not fine-tuned:
223
- - EN-GINCO: English enTenTen20 corpus, annotated with GINCO labels
224
- - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus, annotated with CORE labels
225
 
226
  | Trained on | Tested on | Micro F1 | Macro F1 |
227
  |:-------------|:------------|-----------:|-----------:|
228
  | X-GENRE | EN-GINCO | 0.688 | 0.691 |
229
  | X-GENRE | FinCORE | 0.674 | 0.581 |
230
- | MT-GINCO | EN-GINCO | 0.654 | 0.538 |
231
- | SI-GINCO | EN-GINCO | 0.632 | 0.502 |
232
  | FTD | EN-GINCO | 0.574 | 0.475 |
233
  | CORE | EN-GINCO | 0.485 | 0.422 |
234
 
 
114
 
115
  # Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres - X-GENRE classifier
116
 
117
+ Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three genre datasets: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
118
 
119
  ## Model description
120
 
121
+ The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on the comparison of labels and cross-dataset experiments (described in details [here](https://github.com/TajaKuzman/Genre-Datasets-Comparison)).
122
 
123
  ## X-GENRE categories
124
 
 
146
 
147
  ### Fine-tuning hyperparameters
148
 
149
+ Fine-tuning was performed with `simpletransformers`. Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
150
 
151
  ```python
152
  model_args= {
 
163
 
164
  An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
165
 
166
+ For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words). It is advised that the predictions, predicted with confidence lower than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not enough features of any genre, and these predictions can be discarded as well.
167
 
168
+ After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection reached macro and micro F1 of 0.92.
169
 
170
 
171
  ### Use examples
 
198
 
199
  ### Comparison with other models at in-dataset and cross-dataset experiments
200
 
201
+ The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
202
 
203
  At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
204
 
 
209
  | CORE | 0.778 | 0.627 |
210
  | GINCO | 0.754 | 0.75 |
211
 
212
+ When applied on test splits of each of the datasets, the classifier performs well:
213
 
214
  | Trained on | Tested on | Micro F1 | Macro F1 |
215
  |:-------------|:------------|-----------:|-----------:|
 
217
  | X-GENRE | FTD | 0.804 | 0.809 |
218
  | X-GENRE | X-GENRE | 0.797 | 0.794 |
219
  | X-GENRE | X-GENRE-dev | 0.784 | 0.784 |
220
+ | X-GENRE | GINCO | 0.749 | 0.758 |
221
 
222
+ The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
223
+ - EN-GINCO: a sample of the English enTenTen20 corpus
224
+ - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
225
 
226
  | Trained on | Tested on | Micro F1 | Macro F1 |
227
  |:-------------|:------------|-----------:|-----------:|
228
  | X-GENRE | EN-GINCO | 0.688 | 0.691 |
229
  | X-GENRE | FinCORE | 0.674 | 0.581 |
230
+ | GINCO | EN-GINCO | 0.632 | 0.502 |
 
231
  | FTD | EN-GINCO | 0.574 | 0.475 |
232
  | CORE | EN-GINCO | 0.485 | 0.422 |
233