classla
/

xlm-roberta-base-multilingual-text-genre-classifier

@@ -114,11 +114,11 @@ widget:
 # Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres - X-GENRE classifier
-Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
 ## Model description
-The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on cross-dataset experiments (described in details [here](https://github.com/TajaKuzman/Genre-Datasets-Comparison)). The joint schema was mapped to the datasets in the merged dataset.
 ## X-GENRE categories
@@ -146,7 +146,7 @@ Description of labels:
 ### Fine-tuning hyperparameters
-Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
 ```python
 model_args= {
@@ -163,9 +163,9 @@ model_args= {
 An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
-For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words). It is advised that the predictions, predicted with confidence prediction lower than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not enough features of any genre, and these predictions can be discarded as well.
-After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection is macro F1: 0.92, micro F1: 0.92.
 ### Use examples
@@ -198,7 +198,7 @@ predictions
 ### Comparison with other models at in-dataset and cross-dataset experiments
-The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schemata (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
 At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
@@ -209,7 +209,7 @@ At the in-dataset experiments (trained and tested on splits of the same dataset)
 | CORE         |      0.778 |      0.627 |
 | GINCO     |      0.754 |      0.75  |
-At the cross-dataset experiments (trained on X-GENRE dataset, tested on splits of separate genre datasets), the classifier performs well:
 | Trained on   | Tested on   |   Micro F1 |   Macro F1 |
 |:-------------|:------------|-----------:|-----------:|
@@ -217,18 +217,17 @@ At the cross-dataset experiments (trained on X-GENRE dataset, tested on splits o
 | X-GENRE      | FTD         |      0.804 |      0.809 |
 | X-GENRE      | X-GENRE     |      0.797 |      0.794 |
 | X-GENRE      | X-GENRE-dev |      0.784 |      0.784 |
-| X-GENRE      | SI-GINCO    |      0.749 |      0.758 |
-The classifiers was compared with classifiers (which were trained only on one of the datasets) on 2 additional genre datasets (based on the joint mapping) on which it was not fine-tuned:
-- EN-GINCO: English enTenTen20 corpus, annotated with GINCO labels
-- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus, annotated with CORE labels
 | Trained on   | Tested on   |   Micro F1 |   Macro F1 |
 |:-------------|:------------|-----------:|-----------:|
 | X-GENRE      | EN-GINCO    |      0.688 |      0.691 |
 | X-GENRE      | FinCORE    |      0.674 |      0.581 |
-| MT-GINCO     | EN-GINCO    |      0.654 |      0.538 |
-| SI-GINCO     | EN-GINCO    |      0.632 |      0.502 |
 | FTD          | EN-GINCO    |      0.574 |      0.475 |
 | CORE         | EN-GINCO    |      0.485 |      0.422 |

 # Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres - X-GENRE classifier
+Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three genre datasets: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
 ## Model description
+The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on the comparison of labels and cross-dataset experiments (described in details [here](https://github.com/TajaKuzman/Genre-Datasets-Comparison)).
 ## X-GENRE categories
 ### Fine-tuning hyperparameters
+Fine-tuning was performed with `simpletransformers`. Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
 ```python
 model_args= {
 An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
+For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words). It is advised that the predictions, predicted with confidence lower than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not enough features of any genre, and these predictions can be discarded as well.
+After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection reached macro and micro F1 of 0.92.
 ### Use examples
 ### Comparison with other models at in-dataset and cross-dataset experiments
+The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
 At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
 | CORE         |      0.778 |      0.627 |
 | GINCO     |      0.754 |      0.75  |
+When applied on test splits of each of the datasets, the classifier performs well:
 | Trained on   | Tested on   |   Micro F1 |   Macro F1 |
 |:-------------|:------------|-----------:|-----------:|
 | X-GENRE      | FTD         |      0.804 |      0.809 |
 | X-GENRE      | X-GENRE     |      0.797 |      0.794 |
 | X-GENRE      | X-GENRE-dev |      0.784 |      0.784 |
+| X-GENRE      | GINCO    |      0.749 |      0.758 |
+The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
+- EN-GINCO: a sample of the English enTenTen20 corpus
+- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
 | Trained on   | Tested on   |   Micro F1 |   Macro F1 |
 |:-------------|:------------|-----------:|-----------:|
 | X-GENRE      | EN-GINCO    |      0.688 |      0.691 |
 | X-GENRE      | FinCORE    |      0.674 |      0.581 |
+| GINCO     | EN-GINCO    |      0.632 |      0.502 |
 | FTD          | EN-GINCO    |      0.574 |      0.475 |
 | CORE         | EN-GINCO    |      0.485 |      0.422 |