Potential problematic behavior of truncation and/or padding?
#5
by
Alchan
- opened
Hi, I'm trying to import your model with huggingface tokenizers and transformers and doing some experiment on it. Because I do the tokenization task on texts with various ranges from very short sentences to 8k sentences. So I don't want any truncation/padding.
I find tokenizer.json
in this repo contains the additional truncation and padding configurations. Is It intentional? If so, how can I turn off these logics?
"truncation": {
"direction": "Right",
"max_length": 512,
"strategy": "LongestFirst",
"stride": 0
},
"padding": {
"strategy": {
"Fixed": 512
},
"direction": "Right",
"pad_to_multiple_of": null,
"pad_id": 128001,
"pad_type_id": 0,
"pad_token": "<|end_of_text|>"
}
This is not intentional as we are just copy-pasting tokenizer from the unquantized model. It seems to be that unquantized model updated these files after we did quantization.
To turn off this logic, please feel free to copy-paste tokenizer.json
from the unquantized model.
ekurtic
changed discussion status to
closed