Stop token is missing in tokenizer vocab
The following is included as an instruction in the section "Prompt Format" of the model card:We recommend using <extra_id_1> as a stop token.
However, the tokenizer vocab does not include a token for <extra_id_1>
and the string tokenizes to multiple tokens [1060, 37600, 3384, 1095, 1049, 1062]
. This breaks the model usage with the prompt template for chat.
Please use stop_strings
as in the examples in the model card. For example,
outputs = model.generate(tokenized_chat, stop_strings=["<extra_id_1>"], tokenizer=tokenizer)
The following is included as an instruction in the section "Prompt Format" of the model card:
We recommend using <extra_id_1> as a stop token.
However, the tokenizer vocab does not include a token for
<extra_id_1>
and the string tokenizes to multiple tokens[1060, 37600, 3384, 1095, 1049, 1062]
. This breaks the model usage with the prompt template for chat.
nvidia dgaf about open standards and never have. P-agg using their own snowflake tokens and not even including them in the tokenizer defines fits right in line with how they do business.