[apply_chat_template] add_generation_prompt does not work as expected.
- add_generation_prompt=False has the same result as add_generation_prompt=True.
- It's interesting to see <|im_end|> is added to special token while <|im_start|> is not...
- You can find that
add_generation_prompt
is not present in the chat_template of thetokenizer_config
file, which is different from v1, so this parameter has no effect. I changed the code for quickstart and deleted it. - You can use
tokenizer. decode (input_ids [0])
to decode and observe the prompt afterapply_chat_template
, which has im_start. For model generate, only |im_end|
About 2. , that's what I did yesterday. But I use tokenizer.convert_ids_to_tokens(input_ids)
to see the `token (in text space).
|im_end| is a standalone token while |im_start| is not.
As long as this behavior is how model is trained during alignment stage then this is not an issue. I just want to confirm that's an intended result.
Here is my results from latest transformers v4.40.2 & tokenizer v0.19.1
>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-34B-chat")
messages = [
{"role": "user", "content": "Who are you?"},
{"role": "assistant", "content": "I am Yi."}
]
out = tok.apply_chat_template(messages, tokenize=True, return_dict=True)
tok.convert_ids_to_tokens(out['input_ids'])2024-05-14 11:24:49,011 urllib3.connectionpool [DEBUG] - Starting new HTTPS connection (1): huggingface.co:443
2024-05-14 11:24:49,360 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /01-ai/Yi-1.5-34B-chat/resolve/main/tokenizer_config.json HTTP/1.1" 307 0
2024-05-14 11:24:49,587 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "HEAD /01-ai/Yi-1.5-34B-Chat/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
>>> messages = [
... {"role": "user", "content": "Who are you?"},
... {"role": "assistant", "content": "I am Yi."}
... ]
>>> out = tok.apply_chat_template(messages, tokenize=True, return_dict=True)
>>> tok.convert_ids_to_tokens(out['input_ids'])
['▁<', '|', 'im', '_', 'start', '|>', 'user', '\n', 'Who', '▁are', '▁you', '?', '<|im_end|>', '▁', '\n', '<', '|', 'im', '_', 'start', '|>', 'ass', 'istant', '\n', 'I', '▁am', '▁Y', 'i', '.', '<|im_end|>', '▁', '\n']
Thank you for pointing out this is not an desired result. I have no idea how to get the desired result yet :(
For AutoTokenizer.from_pretrained("01-ai/Yi-1.5-34B-chat")
, set use_fast=False
, we do not support fast tokenizer, just like llama
Oh, I saw the example your official readme. Thank you for your help!
(So many pitfalls for working with various llms on hf hubs). Is there any way for your team to set use_fast=False
explicitly in tokenizer_config.json?
An off-topic question:
Is there any way to check whether fast tokenizer is supported for a specific model?
What's in my mind was: --> from transformer library if I can find tokenization_gemma_fast.py --> means gemma support fast tokenizer.