classla
/

xlm-roberta-base-multilingual-text-genre-classifier

 ---
 license: cc-by-sa-4.0
+language:
+- multilingual
+- af
+- am
+- ar
+- as
+- az
+- be
+- bg
+- bn
+- br
+- bs
+- ca
+- cs
+- cy
+- da
+- de
+- el
+- en
+- eo
+- es
+- et
+- eu
+- fa
+- fi
+- fr
+- fy
+- ga
+- gd
+- gl
+- gu
+- ha
+- he
+- hi
+- hr
+- hu
+- hy
+- id
+- is
+- it
+- ja
+- jv
+- ka
+- kk
+- km
+- kn
+- ko
+- ku
+- ky
+- la
+- lo
+- lt
+- lv
+- mg
+- mk
+- ml
+- mn
+- mr
+- ms
+- my
+- ne
+- nl
+- no
+- om
+- or
+- pa
+- pl
+- ps
+- pt
+- ro
+- ru
+- sa
+- sd
+- si
+- sk
+- sl
+- so
+- sq
+- sr
+- su
+- sv
+- sw
+- ta
+- te
+- th
+- tl
+- tr
+- ug
+- uk
+- ur
+- uz
+- vi
+- xh
+- yi
+- zh
+tags:
+- text-classification
+- genre
+- text-genre
+widget:
+- text: "On our site, you can find a great genre identification model which you can use for thousands of different tasks. For free!"
 ---
+# Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres
+Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
+## Model description
+### Fine-tuning hyperparameters
+Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
+```python
+model_args= {
+            "num_train_epochs": 15,
+            "learning_rate": 1e-5,
+            "max_seq_length": 512,
+            }
+```
+## Intended use and limitations
+## Usage
+### Use examples
+```python
+from simpletransformers.classification import ClassificationModel
+model_args= {
+            "num_train_epochs": 15,
+            "learning_rate": 1e-5,
+            "max_seq_length": 512,
+            }
+model = ClassificationModel(
+    "xlmroberta", "TajaKuzman/xlm-roberta-base-multilingual-text-genres", use_cuda=True,
+    args=model_args
+)
+predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.",
+                                        "On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
+                                        )
+predictions
+### Output:
+### array([1, 0])
+```
+## Performance
+## Citation
+If you use the model, please cite the GitHub repository where the fine-tuning experiments are explained:
+```
+ @misc{Kuzman2022,
+  author = {Kuzman, Taja},
+  title = {{Comparison of genre datasets: CORE, GINCO and FTD}},
+  year = {2022},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}}
+}
+```
+and the following paper on which the original model is based:
+```
+@article{DBLP:journals/corr/abs-1911-02116,
+  author    = {Alexis Conneau and
+               Kartikay Khandelwal and
+               Naman Goyal and
+               Vishrav Chaudhary and
+               Guillaume Wenzek and
+               Francisco Guzm{\'{a}}n and
+               Edouard Grave and
+               Myle Ott and
+               Luke Zettlemoyer and
+               Veselin Stoyanov},
+  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
+  journal   = {CoRR},
+  volume    = {abs/1911.02116},
+  year      = {2019},
+  url       = {http://arxiv.org/abs/1911.02116},
+  eprinttype = {arXiv},
+  eprint    = {1911.02116},
+  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```
+To cite the datasets that were used for fine-tuning:
+CORE dataset:
+```
+@article{egbert2015developing,
+  title={Developing a bottom-up, user-based method of web register classification},
+  author={Egbert, Jesse and Biber, Douglas and Davies, Mark},
+  journal={Journal of the Association for Information Science and Technology},
+  volume={66},
+  number={9},
+  pages={1817--1831},
+  year={2015},
+  publisher={Wiley Online Library}
+}
+```
+GINCO dataset:
+```
+@InProceedings{kuzman-rupnik-ljubei:2022:LREC,
+  author    = {Kuzman, Taja  and  Rupnik, Peter  and  Ljube{\v{s}}i{\'c}, Nikola},
+  title     = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}},
+  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
+  month          = {},
+  year           = {2022},
+  address        = {Marseille, France},
+  publisher      = {European Language Resources Association},
+  pages     = {1584--1594},
+  url       = {https://aclanthology.org/2022.lrec-1.170}
+}
+```
+FTD dataset:
+```
+@article{sharoff2018functional,
+  title={Functional text dimensions for the annotation of web corpora},
+  author={Sharoff, Serge},
+  journal={Corpora},
+  volume={13},
+  number={1},
+  pages={65--95},
+  year={2018},
+  publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…}
+}
+```
+The datasets are available at:
+1. http://hdl.handle.net/11356/1467 (GINCO)
+2. https://github.com/TurkuNLP/CORE-corpus (CORE)
+3. https://github.com/ssharoff/genre-keras (FTD)