GGUF files have tokenizer issues
#1
by
JohannesGaessler
- opened
The models in this repository seem to have tokenizer issues, see https://github.com/ggerganov/llama.cpp/pull/6936#issuecomment-2107368738 , causing degraded results. This is indicated by the following warning when running the models:
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
@JohannesGaessler I’ll look into this and get back. V1 was done before the BPE fix and this is done after the BPE fix