slovakbert-ner / README.md
crabz's picture
Update README.md
2e62fdd
|
raw
history blame
No virus
5.7 kB
metadata
license: mit
language:
  - sk
tags:
  - generated_from_trainer
datasets:
  - wikiann
metrics:
  - precision
  - recall
  - f1
  - accuracy
inference: false
widget:
  - text: Zuzana Čaputová sa narodila 21. júna 1973 v Bratislave.
    example_title: Named Entity Recognition
model-index:
  - name: slovakbert-ner
    results:
      - task:
          name: Token Classification
          type: token-classification
        dataset:
          name: wikiann
          type: wikiann
          args: sk
        metrics:
          - name: Precision
            type: precision
            value: 0.9327115256495669
          - name: Recall
            type: recall
            value: 0.9470124013528749
          - name: F1
            type: f1
            value: 0.9398075632132469
          - name: Accuracy
            type: accuracy
            value: 0.9785228256835333

Named Entity Recognition based on SlovakBERT

This model is a fine-tuned version of gerulata/slovakbert on the Slovak wikiann dataset. It achieves the following results on the evaluation set:

  • Loss: 0.1600
  • Precision: 0.9327
  • Recall: 0.9470
  • F1: 0.9398
  • Accuracy: 0.9785

Intended uses & limitations

Supported classes: LOCATION, PERSON, ORGANIZATION

from transformers import pipeline


ner_pipeline = pipeline(task='ner', model='crabz/slovakbert-ner')
input_sentence = "Minister financií a líder mandátovo najsilnejšieho hnutia OĽaNO Igor Matovič upozorňuje, že následky tretej vlny budú na Slovensku veľmi veľké."
classifications = ner_pipeline(input_sentence)

with displaCy:

import spacy
from spacy import displacy


ner_map = {0: '0', 1: 'B-OSOBA', 2: 'I-OSOBA', 3: 'B-ORGANIZÁCIA', 4: 'I-ORGANIZÁCIA', 5: 'B-LOKALITA', 6: 'I-LOKALITA'}

entities = []
for i in range(len(classifications)):
    if classifications[i]['entity'] != 0:
        if ner_map[classifications[i]['entity']][0] == 'B':
            j = i + 1
            while j < len(classifications) and ner_map[classifications[j]['entity']][0] == 'I':
                j += 1
            entities.append((ner_map[classifications[i]['entity']].split('-')[1], classifications[i]['start'],
                             classifications[j - 1]['end']))

nlp = spacy.blank("en")  # it should work with any language

doc = nlp(input_sentence)

ents = []
for ee in entities:
    ents.append(doc.char_span(ee[1], ee[2], ee[0]))

doc.ents = ents

options = {"ents": ["OSOBA", "ORGANIZÁCIA", "LOKALITA"],
           "colors": {"OSOBA": "lightblue", "ORGANIZÁCIA": "lightcoral", "LOKALITA": "lightgreen"}}
displacy_html = displacy.render(doc, style="ent", options=options)
Minister financií a líder mandátovo najsilnejšieho hnutia OĽaNO ORGANIZÁCIA Igor Matovič OSOBA upozorňuje, že následky tretej vlny budú na Slovensku LOKALITA veľmi veľké.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 32
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 15.0

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.2342 1.0 625 0.1233 0.8891 0.9076 0.8982 0.9667
0.1114 2.0 1250 0.1079 0.9118 0.9269 0.9193 0.9725
0.0817 3.0 1875 0.1093 0.9173 0.9315 0.9243 0.9747
0.0438 4.0 2500 0.1076 0.9188 0.9353 0.9270 0.9743
0.028 5.0 3125 0.1230 0.9143 0.9387 0.9264 0.9744
0.0256 6.0 3750 0.1204 0.9246 0.9423 0.9334 0.9765
0.018 7.0 4375 0.1332 0.9292 0.9416 0.9353 0.9770
0.0107 8.0 5000 0.1339 0.9280 0.9427 0.9353 0.9769
0.0079 9.0 5625 0.1368 0.9326 0.9442 0.9383 0.9785
0.0065 10.0 6250 0.1490 0.9284 0.9445 0.9364 0.9772
0.0061 11.0 6875 0.1566 0.9328 0.9433 0.9380 0.9778
0.0031 12.0 7500 0.1555 0.9339 0.9473 0.9406 0.9787
0.0024 13.0 8125 0.1548 0.9349 0.9462 0.9405 0.9787
0.0015 14.0 8750 0.1562 0.9330 0.9469 0.9399 0.9788
0.0013 15.0 9375 0.1600 0.9327 0.9470 0.9398 0.9785

Framework versions

  • Transformers 4.13.0.dev0
  • Pytorch 1.10.0+cu113
  • Datasets 1.15.1
  • Tokenizers 0.10.3