Arabic NER Model for AQMAR dataset

Training was conducted over 86 epochs, using a linear decaying learning rate of 2e-05, starting from 0.3 and a batch size of 48 with fastText and Flair forward and backward embeddings.

Original Dataset:

AQMAR

Results:

F1-score (micro) 0.9323
F1-score (macro) 0.9272

	True Posititves	False Positives	False Negatives	Precision	Recall	class-F1
LOC	164	7	13	0.9591	0.9266	0.9425
MISC	398	22	37	0.9476	0.9149	0.9310
ORG	65	6	9	0.9155	0.8784	0.8966
PER	199	13	13	0.9387	0.9387	0.9387

Usage

from flair.data import Sentence
from flair.models import SequenceTagger
import pyarabic.araby as araby
from icecream import ic

arTagger = SequenceTagger.load('megantosh/flair-arabic-MSA-aqmar')

sentence = Sentence('George Washington went to Washington .')
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')


# predict NER tags
tagger.predict(sentence)
arTagger.predict(arSentence)

# print sentence with predicted tags
ic(sentence.to_tagged_string)
ic(arSentence.to_tagged_string)

Example

see an example from a similar NER model in Flair

Model Configuration

  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('ar')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4396, out_features=4396, bias=True)
  (rnn): LSTM(4396, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=14, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2021-03-31 22:19:50,654 ----------------------------------------------------------------------------------------------------
2021-03-31 22:19:50,654 Corpus: "Corpus: 3025 train + 336 dev + 373 test sentences"
2021-03-31 22:19:50,654 ----------------------------------------------------------------------------------------------------
2021-03-31 22:19:50,654 Parameters:
2021-03-31 22:19:50,654  - learning_rate: "0.3"
2021-03-31 22:19:50,654  - mini_batch_size: "48"
2021-03-31 22:19:50,654  - patience: "3"
2021-03-31 22:19:50,654  - anneal_factor: "0.5"
2021-03-31 22:19:50,654  - max_epochs: "150"
2021-03-31 22:19:50,654  - shuffle: "True"
2021-03-31 22:19:50,654  - train_with_dev: "False"
2021-03-31 22:19:50,654  - batch_growth_annealing: "False"
2021-03-31 22:19:50,655 ------------------------------------

Due to the right-to-left in left-to-right context, some formatting errors might occur. and your code might appear like this, (link accessed on 2020-10-27)

Citation

if you use this model, please consider citing this work:

@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
year = {2021},
doi = "10.13140/RG.2.2.34961.10084"
url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
}