Tokenizer vocabulary
#28
by
DjTobalito
- opened
Hi,
Using the XLM Roberta for multilanguage classification with success. I am trying to understand a bit better the tokenizer.
Naively, I expected that common words of small size in the languages present in the dataset to be present in the tokenizer.vocab
dictionary.
But it seems that for French for example, the word "oui" (yes in French) is not in the tokenizer.vocab
dictionary.
Am I misunderstanding the tokenizer.vocab
dictionary ?