Vocabulary difference between `tekken.json` and `tokenizer.json`

#20
by NXz64Fdf8Y - opened

Hi,

We've observed a vocabulary difference between tekken.json and tokenizer.json. When using the Hugging Face tokenizer with tokenizer.json, the tokenization result appears to match that of tokenizer = MistralTokenizer.v3(is_tekken=True) from the Mistral AI codebase (https://github.com/mistralai).

Could you please clarify the purpose of tekken.json and explain when and how it is used? Understanding its difference will help us use the tokenizer correctly.

Thank you very much and kind regards,

Sign up or log in to comment