Tokenization issue with diacritics

#1
by banouz - opened

I tried using this tokenizer on diacritized texts but it fails to recognize the SHADDA (โ—Œู‘) and replaces it with token. It also fails to tokenize MultiWordTokens correctly when the input text is diacritized (Tried with input "ูˆูŽุฅู†ู‘ูŽู…ุงุŒ")

Sign up or log in to comment