Overflow of tokenizer caused by problematic model_max_len
#16
by
wanghaofan
- opened
To reproduce the issue, run following lines in the command
from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
pretrained_model_name_or_path = "IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese"
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path)
captions = ["一只猫", '测试']
inputs = tokenizer(captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True)
If padding mode is set to 'max_length', it will raise overflow. Actually, I think this is a typo once you print out tokenizer.model_max_len
, which is quite abnormal. To solve it, just set tokenizer.model_max_len
to a small number such as 77
.
we did not set the model_max_length in the config file of "IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese" , so by default it will be a very large number.
We fix it when training stable diffusion and manually set it to 512. you can try this:
tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1", subfolder="tokenizer")
tokenizer.model_max_length
wuxiaojun
changed discussion status to
closed