strange tokens output
#2
by
Geo
- opened
Hi,
When I use the model's tokenizer in order to tokenize a Greek sentence
tokenizer = load_tokenizer(model_path)
tokenizer.tokenize(“Ποιο τρίγωνο λέγεται αμβλυγώνιο?”)
I get
[‘Î’,
‘ł’,
‘ο’,
‘ιο’,
‘ĠÏĦÏģίγÏīνο’,
‘ĠλÎŃγεÏĦαι’,
‘Ġαμβ’,
‘λÏħ’,
‘γÏİν’,
‘ιο’,
‘?’]
Is this normal? Should't I see tokens or sub-word tokens in Greek?
Also when I open the vocabulary I don't see any Greek words.
I want to fine tune your model for text generation in the Greek language