databricks/dbrx-instruct · why does the im_start and im_end token id exceed the tokenizer.voc

Apr 2

I found that the im_start token id is 100278 which is bigger than the tokenizer voc_size, and I tried to add the special token, it returns 1 which means a valid token is add to the tokenizer, but the voc_size still keep the same. Is there anything wrong?

the transformers and tiktoken version is tiktoken-0.6.0 transformers-4.39.2
here is the script I tried:

In [3]: tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True, trust_remote_code=True)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

In [4]: len(tokenizer)
Out[4]: 100277

In [5]: tokenizer.encode('<|im_end|>',add_special_tokens=False)
Out[5]: [100279]

In [6]: tokenizer.encode('<|im_start|>',add_special_tokens=False)
Out[6]: [100278]

In [7]: tokenizer.special_tokens_map_extended
Out[7]: 
{'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 'pad_token': AddedToken("<|pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}

In [8]: tokenizer.add_special_tokens({"additional_special_tokens":["<|im_start|>"]})
Out[8]: 1

In [9]: tokenizer.add_special_tokens({"additional_special_tokens":["<|im_end|>"]})
Out[9]: 1

In [10]: len(tokenizer)
Out[10]: 100277

In [11]: tokenizer.vocab_size
Out[11]: 100277

In [12]: tokenizer.chat_template

muziyongshixin changed discussion title from why the im_start and im_end token exceed the tokenizer.voc_size? to why does the im_start and im_end token exceed the tokenizer.voc_size? Apr 2

muziyongshixin changed discussion title from why does the im_start and im_end token exceed the tokenizer.voc_size? to why does the im_start and im_end token id exceed the tokenizer.voc_size? Apr 2

daking

Databricks org Apr 2

Thanks for reporting, this is a bug. Fixing here and here

hanlintang changed discussion status to closed Apr 3