Vocabulary difference between `tekken.json` and `tokenizer.json`
#20
by
NXz64Fdf8Y
- opened
Hi,
We've observed a vocabulary difference between tekken.json
and tokenizer.json
. When using the Hugging Face tokenizer with tokenizer.json
, the tokenization result appears to match that of tokenizer = MistralTokenizer.v3(is_tekken=True)
from the Mistral AI codebase (https://github.com/mistralai).
Could you please clarify the purpose of tekken.json
and explain when and how it is used? Understanding its difference will help us use the tokenizer correctly.
Thank you very much and kind regards,