Duplicate tokens
#15
by
noobhappylife
- opened
While taking a closer look at the tokenizer.json, i noticed the following added_tokens have the same content.
{
"id": 128268,
"content": "<|reserved_special_token_262|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 128269,
"content": "<|reserved_special_token_262|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
:o hmm
I think this may affect correctness if the user uses https://github.com/huggingface/tokenizers with this tokenizer.json
, which would ignore the second appearance and shift tokens >= 128269 forward, shrinking the vocab size by one. So when decoding, it may not match what the model actually means. I am guessing regenerating a tokenizer.json
should fix it, since the duplication is not there in tokenizer_config.json
.
Try now