license: mit
language:
- sk
pipeline_tag: token-classification
tags:
- SlovakBERT
SlovakBERT address NER
SlovakBERT based model for named entity recognition of Slovak addresses.
This work is a joint effort of Slovak National Competence Center for High-Performance Computing and Nettle, s.r.o., a Slovak-based start-up focusing on natural language processing, chatbots and voicebots.
Model usage
The model recognizes following entities: STREET, HOUSENUMBER, MUNICIPALITY, POSTALNUMBER. It uses BIO annotation scheme and therefore together with the O label has 9 labels in total.
It is inteded to be used only on SLOVAK addresses.
The primary use is to annotate input from speech-to-text transcriptions, therefore it handles natural speech hesitations (e.g. "Ďalej", "no", "uh").
Both house number and cadastral registration number are labelled as HOUSENUMBER. Names of parts of a municipalities are also labelled as MUNICIPALITY.
Preprocessing and input format
The input is preprocessed so that it doesn't contain any commas! The input can be both lower case and upper case, even contain errors when a proper noun starts lowercase. The house number can have two parts separated by a slash and it can contain a letter from A to F at the end (e.g. "Mätová ulica 97/25C"). The postal number is always composed of 5 digits, but can be split into two parts by 3 and 2 digits respectively (e.g. "923 12", "84401"). The street can contain shortened parts (e.g. "Ulica J. Matúšku").
Code example
from transformers import pipeline
ner_pipeline = pipeline(task='ner', model='nettle-ai/slovakbert-address-ner')
input_sentence = "Žiškova uhm 21 85510 no Pezinok"
classifications = ner_pipeline(input_sentence)
Acknowledgement
The research results were obtained with the support of the Slovak National competence centre for HPC, the EuroCC 2 project and Slovak National Supercomputing Centre under grant agreement 101101903-EuroCC 2-DIGITAL-EUROHPC-JU-2022-NCC-01.
Framework Versions
- Transformers 4.26.0
- PyTorch 1.13.1
- Tokenizers 0.13.2