Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
Inference Endpoints
File size: 19,275 Bytes
0e3e5b2
 
28fac11
83a7051
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28fac11
83a7051
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28fac11
 
 
 
 
 
 
 
 
 
 
 
 
 
de7ed0f
c613f0e
 
0e3e5b2
83a7051
d699313
83a7051
6be9a30
de7ed0f
6be9a30
83a7051
9b8116e
 
8784f89
 
9b8116e
39047a3
ac6c965
 
 
 
 
 
 
 
 
 
 
 
39047a3
ac6c965
83a7051
6be9a30
 
 
ac6c965
5ffe01f
6be9a30
ac6c965
 
 
 
 
 
 
 
 
 
 
 
 
83a7051
 
 
 
ac6c965
83a7051
39047a3
 
5ffe01f
 
39047a3
f98b0d9
39047a3
 
83a7051
 
 
 
 
 
 
 
ab3f058
83a7051
 
15883fb
83a7051
 
 
 
 
 
 
be4e3d7
 
 
 
 
83a7051
 
ab3f058
 
ac2bd19
 
6be9a30
 
ac2bd19
 
 
 
 
 
 
6be9a30
ac2bd19
 
 
 
 
 
 
 
 
 
 
 
 
 
83a7051
 
39047a3
 
6be9a30
 
39047a3
6be9a30
 
39047a3
 
 
 
 
 
 
 
f98b0d9
39047a3
 
 
 
 
 
 
f98b0d9
39047a3
f98b0d9
5ffe01f
f98b0d9
39047a3
 
 
 
 
f98b0d9
39047a3
 
 
6be9a30
 
83a7051
ac6c965
 
6be9a30
 
ac6c965
 
 
 
 
 
 
 
 
83a7051
9b8116e
 
de7ed0f
83a7051
 
07f72cb
 
 
 
 
 
 
 
 
83a7051
28fac11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
---
license: cc-by-sa-4.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- text-classification
- genre
- text-genre
widget:
- text: >-
    On our site, you can find a great genre identification model which you can
    use for thousands of different tasks. For free!
  example_title: English
- text: >-
    Na naši spletni strani lahko najdete odličen model za prepoznavanje žanrov,
    ki ga lahko uporabite pri na tisoče različnih nalogah. In to brezplačno!
  example_title: Slovene
- text: >-
    Sur notre site, vous trouverez un modèle d'identification de genre très
    intéressant que vous pourrez utiliser pour des milliers de tâches
    différentes. C'est gratuit !
  example_title: French
datasets:
- TajaKuzman/X-GENRE-text-genre-dataset
base_model:
- FacebookAI/xlm-roberta-base
---

# X-GENRE classifier - multilingual text genre classifier

Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
and fine-tuned on a [multilingual manually-annotated X-GENRE genre dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-text-genre-dataset).
The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.

The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).

The model can also be downloaded from the [CLARIN.SI repository](http://hdl.handle.net/11356/1961).

If you use the model, please cite the paper:

```
@article{kuzman2023automatic,
  title={Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models},
  author={Kuzman, Taja and Mozeti{\v{c}}, Igor and Ljube{\v{s}}i{\'c}, Nikola},
  journal={Machine Learning and Knowledge Extraction},
  volume={5},
  number={3},
  pages={1149--1175},
  year={2023},
  publisher={MDPI}
}
```

## AGILE - Automatic Genre Identification Benchmark

We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
for the automatic enrichment of large text collections with genre information.
You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).

In an out-of-dataset scenario (evaluating a model on a manually-annotated English EN-GINCO dataset (available upon request)) on which it was not trained),
 the model outperforms all other technologies:

|                             |   micro F1 |   macro F1 |   accuracy |
|:----------------------------|-----------:|-----------:|-----------:|
| **XLM-RoBERTa, fine-tuned on the X-GENRE dataset - X-GENRE classifier**  (Kuzman et al. 2023)                   |       0.68 |       0.69 |       0.68 |
| GPT-4 (7/7/2023)  (Kuzman et al. 2023)            |       0.65 |       0.55 |       0.65 |
| GPT-3.5-turbo (Kuzman et al. 2023)    |       0.63 |       0.53 |       0.63 |
| SVM  (Kuzman et al. 2023)                       |       0.49 |       0.51 |       0.49 |
| Logistic Regression (Kuzman et al. 2023)        |       0.49 |       0.47 |       0.49 |
| FastText (Kuzman et al. 2023)                   |       0.45 |       0.41 |       0.45 |
| Naive Bayes  (Kuzman et al. 2023)             |       0.36 |       0.29 |       0.36 |
| mt0                        |       0.32 |       0.23 |       0.27 |
| Zero-Shot classification with `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` @ HuggingFace                 |       0.2  |       0.15 |       0.2  |
| Dummy Classifier (stratified) (Kuzman et al. 2023)|       0.14 |       0.1  |       0.14 |


## Intended use and limitations

### Usage

An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.

For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words).
It is advised that the predictions, predicted with confidence higher than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not have enough features of any genre, and these predictions can be discarded as well.

After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection reached macro and micro F1 of 0.92.


### Use examples

```python
from simpletransformers.classification import ClassificationModel
model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            "silent": True
            }
model = ClassificationModel(
    "xlmroberta", "classla/xlm-roberta-base-multilingual-text-genre-classifier", use_cuda=True,
    args=model_args
    
)
predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.", 
                                        "On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
                                        )
predictions
# Output: array([3, 8])

[model.config.id2label[i] for i in predictions]
# Output: ['Instruction', 'Promotion']

```

Use example for prediction on a dataset, using batch processing, is available via [Google Collab](https://colab.research.google.com/drive/1yC4L_p2t3oMViC37GqSjJynQH-EWyhLr?usp=sharing).

## X-GENRE categories

### List of labels

```
labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],

labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction': 3, 'Opinion/Argumentation': 4, 'Forum': 5, 'Prose/Lyrical': 6, 'Legal': 7, 'Promotion': 8}

```

### Description of labels

|     Label               |     Description|     Examples                                                                                                                                                                                                                                  |
|-------------------------||-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Information/Explanation | An objective text that describes   or presents an event, a person, a thing, a concept etc. Its main purpose is   to inform the reader about something.      Common features: objective/factual, explanation/definition of a concept (x   is …), enumeration.                                                                                                                                                                                                                                                                                                       | research article, encyclopedia article, informational blog, product   specification, course materials, general information, job description,   manual, horoscope, travel guide, glossaries, historical article, biographical   story/history. |
| Instruction             | An objective text which instructs   the readers on how to do something.      Common features: multiple steps/actions, chronological order, 1st person   plural or 2nd person, modality (must, have to, need to, can, etc.), adverbial   clauses of manner (in a way that), of condition (if), of time (after …).                                                                                                                                                                                                                                                   | how-to texts, recipes, technical support                                                                                                                                                                                                      |
| Legal                   | An objective formal text that   contains legal terms and is clearly structured. The name of the text type is   often included in the headline (contract, rules, amendment, general terms and   conditions, etc.).      Common features: objective/factual, legal terms, 3rd person.                                                                                                                                                                                                                                                                                | small print, software license, proclamation, terms and conditions,   contracts, law, copyright notices, university regulation                                                                                                                 |
| News                    | An objective or subjective text   which reports on an event recent at the time of writing or coming in the near   future.      Common features: adverbs/adverbial clauses of time and/or place (dates,   places), many proper nouns, direct or reported speech, past tense.                                                                                                                                                                                                                                                                                        | news report, sports report, travel blog, reportage, police report,   announcement                                                                                                                                                             |
| Opinion/Argumentation   | A subjective text in which the   authors convey their opinion or narrate their experience. It includes   promotion of an ideology and other non-commercial causes. This genre includes a subjective narration of a personal   experience as well.      Common features: adjectives/adverbs that convey opinion, words that convey   (un)certainty (certainly, surely), 1st person, exclamation marks.                                                | review, blog (personal blog, travel blog), editorial, advice, letter to   editor, persuasive article or essay, formal speech, pamphlet, political   propaganda, columns, political manifesto                                                  |
| Promotion               | A subjective text intended to   sell or promote an event, product, or service. It addresses the readers,   often trying to convince them to participate in something or buy something.   Common features: contains adjectives/adverbs that promote something   (high-quality, perfect, amazing), comparative and superlative forms of   adjectives and adverbs (the best, the greatest, the cheapest), addressing the   reader (usage of 2nd person), exclamation marks. | advertisement, promotion of a product (e-shops), promotion of an   accommodation, promotion of company's services, invitation to an event                                                                                                     |
| Forum                   | A text in which people discuss a   certain topic in form of comments.      Common features: multiple authors, informal language, subjective (the   writers express their opinions), written in 1st person.                                                                                                                                                                                                                                                                                                                                                         | discussion forum, reader/viewer responses, QA forum                                                                                                                                                                                           |
| Prose/Lyrical           | A literary text that   consists of paragraphs or verses. A literary text is deemed to have no other   practical purpose than to give pleasure to the reader. Often the author pays   attention to the aesthetic appearance of the text. It can be considered as   art.                                                                                                                                                                                                                                                                                     | lyrics, poem, prayer, joke, novel, short story                                                                                                                                                                                                |
| Other                   | A text that which does not fall   under any of other genre categories.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                               |


## Performance

### Comparison with other models at in-dataset and cross-dataset experiments

The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately,
using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).

At the in-dataset experiments (trained and tested on splits of the same dataset),
it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.

| Trained on   |   Micro F1 |   Macro F1 |
|:-------------|-----------:|-----------:|
| FTD          |      0.843 |      0.851 |
| X-GENRE      |      0.797 |      0.794 |
| CORE         |      0.778 |      0.627 |
| GINCO     |      0.754 |      0.75  |

When applied on test splits of each of the datasets, the classifier performs well:

| Trained on   | Tested on   |   Micro F1 |   Macro F1 |
|:-------------|:------------|-----------:|-----------:|
| X-GENRE      | CORE        |      0.837 |      0.859 |
| X-GENRE      | FTD         |      0.804 |      0.809 |
| X-GENRE      | X-GENRE     |      0.797 |      0.794 |
| X-GENRE      | X-GENRE-dev |      0.784 |      0.784 |
| X-GENRE      | GINCO    |      0.749 |      0.758 |

The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
- EN-GINCO (available upon request): a sample of the English enTenTen20 corpus
- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus

| Trained on   | Tested on   |   Micro F1 |   Macro F1 |
|:-------------|:------------|-----------:|-----------:|
| X-GENRE      | EN-GINCO    |      0.688 |      0.691 |
| X-GENRE      | FinCORE    |      0.674 |      0.581 |
| GINCO     | EN-GINCO    |      0.632 |      0.502 |
| FTD          | EN-GINCO    |      0.574 |      0.475 |
| CORE         | EN-GINCO    |      0.485 |      0.422 |

At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier,
trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.

### Fine-tuning hyperparameters

Fine-tuning was performed with `simpletransformers`. 
Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:

```python
model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            }        
      
```

## Citation

If you use the model, please cite the paper which describes creation of the [X-GENRE dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-text-genre-dataset) and the genre classifier:

```
@article{kuzman2023automatic,
  title={Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models},
  author={Kuzman, Taja and Mozeti{\v{c}}, Igor and Ljube{\v{s}}i{\'c}, Nikola},
  journal={Machine Learning and Knowledge Extraction},
  volume={5},
  number={3},
  pages={1149--1175},
  year={2023},
  publisher={MDPI}
}
```