Azerbaijani Named Entity Recognition (NER) Model
This repository contains the code and model for Named Entity Recognition (NER) in Azerbaijani language. The model is built using the XLM-RoBERTa architecture and fine-tuned on a custom dataset.
Model Description
The model recognizes the following entity types:
- LABEL_0: O: Outside any named entity
- LABEL_1: PERSON: Names of individuals
- LABEL_2 :LOCATION: Geographical locations, both man-made and natural
- LABEL_3 :ORGANISATION: Names of companies, institutions
- LABEL_4 :DATE: Dates or periods
- LABEL_5 :TIME: Times of the day
- LABEL_6 :MONEY: Monetary values
- LABEL_7 :PERCENTAGE: Percentage values
- LABEL_8 :FACILITY: Buildings, airports, etc.
- LABEL_9 :PRODUCT: Products and goods
- LABEL_10 :EVENT: Events and occurrences
- LABEL_11 :ART: Artworks, titles of books, songs
- LABEL_12 :LAW: Legal documents
- LABEL_13 :LANGUAGE: Languages
- LABEL_14 :GPE: Countries, cities, states
- LABEL_15 :NORP: Nationalities or religious or political groups
- LABEL_16 :ORDINAL: Ordinal numbers
- LABEL_17 :CARDINAL: Cardinal numbers
- LABEL_18 :DISEASE: Diseases and medical conditions
- LABEL_19 :CONTACT: Contact information, e.g., phone numbers, emails
- LABEL_20 :ADAGE: Proverbs, sayings
- LABEL_21 :QUANTITY: Measurements and quantities
- LABEL_22 :MISCELLANEOUS: Miscellaneous entities
- LABEL_23 :POSITION: Professional or social positions
- LABEL_24 :PROJECT: Names of projects or programs
Installation
To use the model, you need to install the required libraries. You can do this using pip
:
pip install transformers
pip install datasets
from transformers import pipeline, XLMRobertaTokenizerFast, XLMRobertaForTokenClassification
# Load the model and tokenizer
tokenizer = XLMRobertaTokenizerFast.from_pretrained("LocalDoc/ner_azerbaijan")
model = XLMRobertaForTokenClassification.from_pretrained("LocalDoc/ner_azerbaijan")
# Create NER pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text
example = "Komitədən bildirilib ki, sovet dövründə Azərbaycanda cəmi 17 məscid fəaliyyət göstərirdisə, dövlət müstəqilliyinin bərpasından sonra ölkədə 814 məscid tikilib."
# Perform NER
ner_results = nlp(example)
# Mapping of label indices to their descriptions
label_mapping = {
0: "O",
1: "PERSON",
2: "LOCATION",
3: "ORGANISATION",
4: "DATE",
5: "TIME",
6: "MONEY",
7: "PERCENTAGE",
8: "FACILITY",
9: "PRODUCT",
10: "EVENT",
11: "ART",
12: "LAW",
13: "LANGUAGE",
14: "GPE",
15: "NORP",
16: "ORDINAL",
17: "CARDINAL",
18: "DISEASE",
19: "CONTACT",
20: "ADAGE",
21: "QUANTITY",
22: "MISCELLANEOUS",
23: "POSITION",
24: "PROJECT"
}
# Print results with mapped entity types
for result in ner_results:
entity_group = result['entity_group']
entity_description = label_mapping[int(entity_group.split('_')[-1])]
print({
'entity_group': entity_description,
'score': result['score'],
'word': result['word'],
'start': result['start'],
'end': result['end']
})
License
This model licensed under the CC BY-NC-ND 4.0 license. What does this license allow?
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made.
Non-Commercial: You may not use the material for commercial purposes.
No Derivatives: If you remix, transform, or build upon the material, you may not distribute the modified material.
For more information, please refer to the CC BY-NC-ND 4.0 license.
Contact
For more information, questions, or issues, please contact LocalDoc at [[email protected]].
- Downloads last month
- 3
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.