Stop token is missing in tokenizer vocab

#3
by armin-cpl - opened

The following is included as an instruction in the section "Prompt Format" of the model card:
We recommend using <extra_id_1> as a stop token.

However, the tokenizer vocab does not include a token for <extra_id_1> and the string tokenizes to multiple tokens [1060, 37600, 3384, 1095, 1049, 1062]. This breaks the model usage with the prompt template for chat.

NVIDIA org

Please use stop_strings as in the examples in the model card. For example,

outputs = model.generate(tokenized_chat, stop_strings=["<extra_id_1>"], tokenizer=tokenizer)

The following is included as an instruction in the section "Prompt Format" of the model card:
We recommend using <extra_id_1> as a stop token.

However, the tokenizer vocab does not include a token for <extra_id_1> and the string tokenizes to multiple tokens [1060, 37600, 3384, 1095, 1049, 1062]. This breaks the model usage with the prompt template for chat.

nvidia dgaf about open standards and never have. P-agg using their own snowflake tokens and not even including them in the tokenizer defines fits right in line with how they do business.

Sign up or log in to comment