Incorrect chat template?

#5
by bartowski - opened

None of the tokens that are in the configured or suggested chat templates are in the tokenizer_config.json, and therefore in llama.cpp they don't get tokenized correctly. It looks like the rest of the tokenizer is the default mistral, expecting [INST] [/INST] tokens. Any idea why this was done?

+1, very likely that tokenizer is just plain wrong and current the model is not usable for chat

I would be nice to have a end-of-turn token too, like how chatml has <|im_end|> or llama 3 has <|eot_id|>

NVIDIA org

which version are you using? HF or .nemo?

We're referring to to the tokenizer_config.json in the HF version. The problems are:

  1. The <extra_id_0> and <extra_id_1>are being used in the chat template, but these tokens don't exist in the list of tokens. I believe that they are mistakenly taken from T5 tokenizer.
  2. Without an end-of-turn token, it's not very usable for chat.

I understand that this model is fine-tuned from nvidia/Mistral-NeMo-Minitron-8B-Base, but I believe that there is a problem with the chat template (or the tokenizer) during fine tuning.

I'm wondering, can you re-do the fine tuning with the chat template from mistralai/Mistral-Nemo-Instruct-2407? I'm sure that it would be much easier for the community to use an existing chat template, rather than inventing a new one specific for this model.

Here is an example of Mistral's template:

[INST]{prompt}[/INST]{response}</s>[INST]{prompt}[/INST]{response}</s>

Thank you.

NVIDIA org

Hi all,

The chat template is correct. The model has been fine-tuned using the prompt template in the model card.

The original tiktoken tokenizer doesn’t have <extra_id_0> or <extra_id_1> as vocabulary words, and we keep the same tokenizer as is. As a result, these "special tokens" are tokenized into multiple tokens, and thus cannot be added as special tokens in tokenizer_config.json.

For HF, stop_strings is a solution, which can be added to the generate() function, as provided in the model card. Major inference engines should support the same feature. For example, vLLM has a stop field, where you can add any strings as stop tokens.

We are discussing options to address this. Meanwhile, please use these workarounds for this model. We appreciate your feedback.

@suhara Thanks for the clarification.

My POV is that the all instruct/chat model nowadays use special tokens like BOS or EOS to mark the beginning and ending of a turn. This has many benefits:

  • It reduces the number of tokens to be processed
  • It improve the model's performance, because these start-of-turn and end-of-turn tokens have their own embeddings.
    This is specially useful when temperature is high. For example, model may hallucinate and make up a non-existent token like <extra_id_use_tool> when using high temperature.
  • It is more simple to implement in downstream projects (llama.cpp for example), because now we only need to check for one single token.

To give you some more options, I'd recommend you to pick one of the template available in llama.cpp.

If you plan to re-use the same template for multiple models, I would recommend to pick the one used by mlabonne/AlphaMonarch-7B, as it is very simple to implement and does NOT require adding any tokens. Instead, it relies completely on BOS and EOS:

<s>system
{system_prompt}</s>
<s>user
{prompt}</s>
<s>assistant
{response}</s>
<s>user
{prompt}</s>
<s>assistant
{response}</s>

In anyway, I'd appreciate if you can re-train this model to use a more common chat template. I'm ready to help if needed. Thank you.

Sign up or log in to comment