[apply_chat_template] add_generation_prompt does not work as expected.

by skyshine102 - opened May 13

May 13

add_generation_prompt=False has the same result as add_generation_prompt=True.
It's interesting to see <|im_end|> is added to special token while <|im_start|> is not...

skyshine102 changed discussion title from apply_chat_template, add_generation_prompt does not work as expected. to [apply_chat_template] add_generation_prompt does not work as expected. May 13

YShow

May 14

You can find that add_generation_prompt is not present in the chat_template of the tokenizer_config file, which is different from v1, so this parameter has no effect. I changed the code for quickstart and deleted it.
You can use tokenizer. decode (input_ids [0]) to decode and observe the prompt after apply_chat_template, which has im_start. For model generate, only |im_end|

skyshine102

May 14

•

edited May 14

About 2. , that's what I did yesterday. But I use tokenizer.convert_ids_to_tokens(input_ids) to see the `token (in text space).
|im_end| is a standalone token while |im_start| is not.
As long as this behavior is how model is trained during alignment stage then this is not an issue. I just want to confirm that's an intended result.

YShow

May 14

no, |im_start| is a standalone token too. The test results are as follows

skyshine102

May 14

Here is my results from latest transformers v4.40.2 & tokenizer v0.19.1

>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-34B-chat")
messages = [
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "I am Yi."}
]
out = tok.apply_chat_template(messages, tokenize=True, return_dict=True)
tok.convert_ids_to_tokens(out['input_ids'])2024-05-14 11:24:49,011 urllib3.connectionpool [DEBUG] - Starting new HTTPS connection (1): huggingface.co:443
2024-05-14 11:24:49,360 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /01-ai/Yi-1.5-34B-chat/resolve/main/tokenizer_config.json HTTP/1.1" 307 0
2024-05-14 11:24:49,587 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /01-ai/Yi-1.5-34B-Chat/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
>>> messages = [
...     {"role": "user", "content": "Who are you?"},
...     {"role": "assistant", "content": "I am Yi."}
... ]
>>> out = tok.apply_chat_template(messages, tokenize=True, return_dict=True)
>>> tok.convert_ids_to_tokens(out['input_ids'])
['▁<', '|', 'im', '_', 'start', '|>', 'user', '\n', 'Who', '▁are', '▁you', '?', '<|im_end|>', '▁', '\n', '<', '|', 'im', '_', 'start', '|>', 'ass', 'istant', '\n', 'I', '▁am', '▁Y', 'i', '.', '<|im_end|>', '▁', '\n']

Thank you for pointing out this is not an desired result. I have no idea how to get the desired result yet :(

YShow

May 14

For AutoTokenizer.from_pretrained("01-ai/Yi-1.5-34B-chat"), set use_fast=False, we do not support fast tokenizer, just like llama

skyshine102

May 14

Oh, I saw the example your official readme. Thank you for your help!
(So many pitfalls for working with various llms on hf hubs). Is there any way for your team to set use_fast=False explicitly in tokenizer_config.json?

An off-topic question:
Is there any way to check whether fast tokenizer is supported for a specific model?
What's in my mind was: --> from transformer library if I can find tokenization_gemma_fast.py --> means gemma support fast tokenizer.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment