[apply_chat_template] add_generation_prompt does not work as expected.

#6
by skyshine102 - opened
  1. add_generation_prompt=False has the same result as add_generation_prompt=True.
  2. It's interesting to see <|im_end|> is added to special token while <|im_start|> is not...
skyshine102 changed discussion title from apply_chat_template, add_generation_prompt does not work as expected. to [apply_chat_template] add_generation_prompt does not work as expected.
  1. You can find that add_generation_prompt is not present in the chat_template of the tokenizer_config file, which is different from v1, so this parameter has no effect. I changed the code for quickstart and deleted it.
  2. You can use tokenizer. decode (input_ids [0]) to decode and observe the prompt after apply_chat_template, which has im_start. For model generate, only |im_end|

About 2. , that's what I did yesterday. But I use tokenizer.convert_ids_to_tokens(input_ids) to see the `token (in text space).
|im_end| is a standalone token while |im_start| is not.
As long as this behavior is how model is trained during alignment stage then this is not an issue. I just want to confirm that's an intended result.

no, |im_start| is a standalone token too. The test results are as follows
image.png

Here is my results from latest transformers v4.40.2 & tokenizer v0.19.1

>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-34B-chat")
messages = [
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "I am Yi."}
]
out = tok.apply_chat_template(messages, tokenize=True, return_dict=True)
tok.convert_ids_to_tokens(out['input_ids'])2024-05-14 11:24:49,011 urllib3.connectionpool [DEBUG] - Starting new HTTPS connection (1): huggingface.co:443
2024-05-14 11:24:49,360 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /01-ai/Yi-1.5-34B-chat/resolve/main/tokenizer_config.json HTTP/1.1" 307 0
2024-05-14 11:24:49,587 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /01-ai/Yi-1.5-34B-Chat/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
>>> messages = [
...     {"role": "user", "content": "Who are you?"},
...     {"role": "assistant", "content": "I am Yi."}
... ]
>>> out = tok.apply_chat_template(messages, tokenize=True, return_dict=True)
>>> tok.convert_ids_to_tokens(out['input_ids'])
['▁<', '|', 'im', '_', 'start', '|>', 'user', '\n', 'Who', '▁are', '▁you', '?', '<|im_end|>', '▁', '\n', '<', '|', 'im', '_', 'start', '|>', 'ass', 'istant', '\n', 'I', '▁am', '▁Y', 'i', '.', '<|im_end|>', '▁', '\n']

Thank you for pointing out this is not an desired result. I have no idea how to get the desired result yet :(

For AutoTokenizer.from_pretrained("01-ai/Yi-1.5-34B-chat"), set use_fast=False, we do not support fast tokenizer, just like llama

Oh, I saw the example your official readme. Thank you for your help!
(So many pitfalls for working with various llms on hf hubs). Is there any way for your team to set use_fast=False explicitly in tokenizer_config.json?


An off-topic question:
Is there any way to check whether fast tokenizer is supported for a specific model?
What's in my mind was: --> from transformer library if I can find tokenization_gemma_fast.py --> means gemma support fast tokenizer.

Sign up or log in to comment