why does the im_start and im_end token id exceed the tokenizer.voc_size?
#36
by
muziyongshixin
- opened
I found that the im_start token id is 100278 which is bigger than the tokenizer voc_size, and I tried to add the special token, it returns 1 which means a valid token is add to the tokenizer, but the voc_size still keep the same. Is there anything wrong?
the transformers and tiktoken version is tiktoken-0.6.0 transformers-4.39.2
here is the script I tried:
In [3]: tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True, trust_remote_code=True)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
In [4]: len(tokenizer)
Out[4]: 100277
In [5]: tokenizer.encode('<|im_end|>',add_special_tokens=False)
Out[5]: [100279]
In [6]: tokenizer.encode('<|im_start|>',add_special_tokens=False)
Out[6]: [100278]
In [7]: tokenizer.special_tokens_map_extended
Out[7]:
{'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
'pad_token': AddedToken("<|pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}
In [8]: tokenizer.add_special_tokens({"additional_special_tokens":["<|im_start|>"]})
Out[8]: 1
In [9]: tokenizer.add_special_tokens({"additional_special_tokens":["<|im_end|>"]})
Out[9]: 1
In [10]: len(tokenizer)
Out[10]: 100277
In [11]: tokenizer.vocab_size
Out[11]: 100277
In [12]: tokenizer.chat_template
muziyongshixin
changed discussion title from
why the im_start and im_end token exceed the tokenizer.voc_size?
to why does the im_start and im_end token exceed the tokenizer.voc_size?
muziyongshixin
changed discussion title from
why does the im_start and im_end token exceed the tokenizer.voc_size?
to why does the im_start and im_end token id exceed the tokenizer.voc_size?
hanlintang
changed discussion status to
closed