Duplicate tokens

#15

by noobhappylife - opened May 29

May 29

While taking a closer look at the tokenizer.json, i noticed the following added_tokens have the same content.

    {
      "id": 128268,
      "content": "<|reserved_special_token_262|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 128269,
      "content": "<|reserved_special_token_262|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },

teknium

NousResearch org Jul 11

:o hmm

CharlieFRuan

Jul 11

•

edited Jul 11

I think this may affect correctness if the user uses https://github.com/huggingface/tokenizers with this tokenizer.json, which would ignore the second appearance and shift tokens >= 128269 forward, shrinking the vocab size by one. So when decoding, it may not match what the model actually means. I am guessing regenerating a tokenizer.json should fix it, since the duplication is not there in tokenizer_config.json.

teknium

NousResearch org Jul 17

Try now

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment