No [PREFIX] and [SUFFIX] in tokenizer vocab
Hi, I was trying to use the FIM feature with no success. After playing with the tokenizer MistralTokenizer.v3()
I found that both [PREFIX] and [SUFFIX] tokens points to (id 0):
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.tokenizers.base import FIMRequest
tokenizer = MistralTokenizer.v3()
tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b")).text
>>> '<s><unk>return▁a▁+▁b<unk>▁def▁f('
tokenizer.instruct_tokenizer.tokenizer.get_control_token('[INST]')
>>> 3
tokenizer.instruct_tokenizer.tokenizer.get_control_token('[PREFIX]')
>>> 0
tokenizer.instruct_tokenizer.tokenizer.get_control_token('[SUFFIX]')
>>> 0
tokenizer.instruct_tokenizer.tokenizer._vocab[:5]
>>> ['<unk>', '<s>', '</s>', '[INST]', '[/INST]']
I found this test in mistral/mistral-common repository:
from mistral_common.tokens.tokenizers.base import FIMRequest
from mistral_common_private.tokens.tokenizers.mistral import MistralTokenizer
tokenizer = MistralTokenizer.v3()
tokenized = tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b"))
assert tokenized.text == "<s>[SUFFIX]return▁a▁+▁b[PREFIX]▁def▁f("
It must exists a privated tokenizer related to mistral_common_private
🤔. Hence, the public tokenizer has no option to do FIM?
Great catch
@Vokturz
! We rushed that code from mistral/mistral-common
a bit too much yesterday - it's indeed wrong!
The tokenizer will need to be updated as well - bear with me, should be done in 30min!
If you just process the generated text as shown here: https://huggingface.co/mistralai/Codestral-22B-v0.1#fill-in-the-middle-fim it shouldn't have made a difference, but it's indeed better to have the correct tokens set for [SUFFIX] and [PREFIX]
using the provided code:
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.tokenizers.base import FIMRequest
tokenizer = MistralTokenizer.v3()
print(tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b")).text)
print(tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b")).tokens)
prints
'<s><unk>return▁a▁+▁b<unk>▁def▁f('
[1, 0, 1575, 1032, 1416, 1055, 0, 1569, 1053, 29500]
By the looks of it, even the encoding it not setting the right token
We just did a patch release with mistral-common==1.2.1
: https://github.com/mistralai/mistral-common/releases/tag/v1.2.1
and uploaded a new tokenizer: https://huggingface.co/mistralai/Codestral-22B-v0.1/commit/c5f230adeebf56c7ff3bf8620a3678fbdc393516
Even after the upload of new tokenizer, any reason that I am getting the following output if i download the latest hf commit.
from transformers import AutoTokenizer
tokenizer= AutoTokenizer.from_pretrained(".")
>>> tokenizer.convert_tokens_to_id("[SUFFIX]")
0
>>> tokenizer.convert_tokens_to_ids("[PREFIX]")
0
>>> tokenizer.convert_tokens_to_ids("[INST]")
3
because they seem to be using their own tokenizer format... tokenizer.model.v3
rather than the hf formats tokenizer.json, etc. Why? i dunno... seems strange, maybe push people to use their code and become more dependant on mistral...