IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1 · Overflow of tokenizer caused by problematic model_max

Nov 28, 2022

To reproduce the issue, run following lines in the command

from transformers import BertForSequenceClassification, BertConfig, BertTokenizer

pretrained_model_name_or_path = "IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese"
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path)

captions = ["一只猫", '测试']

inputs = tokenizer(captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True)

If padding mode is set to 'max_length', it will raise overflow. Actually, I think this is a typo once you print out tokenizer.model_max_len, which is quite abnormal. To solve it, just set tokenizer.model_max_len to a small number such as 77.

weifeng-chen

Fengshenbang-LM org Nov 29, 2022

we did not set the model_max_length in the config file of "IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese" , so by default it will be a very large number.
We fix it when training stable diffusion and manually set it to 512. you can try this:

tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1", subfolder="tokenizer")
tokenizer.model_max_length

wuxiaojun changed discussion status to closed Jun 13, 2023

IDEA-CCNL
/

Taiyi-Stable-Diffusion-1B-Chinese-v0.1

Overflow of tokenizer caused by problematic model_max_len