data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 69 column 3

#17
by sigridjineth - opened

it seems there's no vocab.json on this repository.

when running data module using pytorch lightning,

from transformers import XLMRobertaTokenizerFast

self.tokenizer = AutoTokenizer.from_pretrained(model, use_local=True) # model: /root/jina-reranker-v2-base-multilingual

I am getting this.

  File "/root/venv/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 112, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 69 column 3

to my local, just cloned the hf repo for latest main branch.

(venv) root@99074ab04cc2:~/FlagEmbedding/experiments/240710/jina# ls /root/jina-reranker-v2-base-multilingual
README.md                     embedding.py             pytorch_model.bin        xlm_padding.py
block.py                      mha.py                   special_tokens_map.json
config.json                   mlp.py                   tokenizer.json
configuration_xlm_roberta.py  modeling_xlm_roberta.py  tokenizer_config.json

I get the same error "data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 69 column 3"

Update the tokenizer and transformers to latest.

Indeed, I suspect that this should help:

pip install -U transformers tokenizers
  • Tom Aarsen

Sign up or log in to comment