Error when loading tokenizer

by danielschnell - opened May 3, 2023

May 3, 2023

I get the following error when the tokenizer is loaded:

Traceback (most recent call last):
  File "/Users/dschnell/test/embeddings.py", line 66, in <module>
    model, tokenizer = load_model_and_tokenizer(args.model)
  File "/Users/dschnell/test/embeddings.py", line 7, in load_model_and_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 702, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1811, in from_pretrained
    return cls._from_pretrained(
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1965, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 155, in __init__
    super().__init__(
  File "/Users/dschnell/test/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: expected value at line 1 column 1

This is with Python 3.10

RVN

MaCoCu org May 3, 2023

Do you need Python 3.10 in your project? I just tried it with 3.9.13 and it works.

danielschnell

May 3, 2023

•

edited May 3, 2023

Probably I could go with 3.9 as well. I just wonder, why the above error does at all occur? I have loaded successfully 3 other models, including xml-roberta-base and never had this problem with Python 3.10.

RVN

MaCoCu org May 4, 2023

Yes, that's weird, thanks for bringing it up. I reallly don't have a good hypothesis as to why this would be the case. If you, by any chance, figure it out, please let us know. I might look into it later.

ZJaume

MaCoCu org May 4, 2023

With Python 3.10.5 and a freshly new installed virtual environment, this code works for me:

from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained('MaCoCu/XLMR-MaCoCu-is', use_fast=True)
t.tokenize('Yes, this is a test')

Could you please post a piece of your code that triggers the error?

danielschnell

May 4, 2023

•

edited May 4, 2023

Here is the gist: https://gist.github.com/lumpidu/f9d068146564f9aea94e42ed2c04f68d
Here is the list of installed packages: https://gist.github.com/lumpidu/2de7636408148f6d39475d1074592b7d

The error happens right in Line 8. I have also now installed it on a Linux machine with Python 3.8.10. Same error.

danielschnell

May 4, 2023

But if I execute your code snippet from above it works ... hmm, it seems my git clone of your repo mixed sth. up .. will investigate

ZJaume

MaCoCu org May 4, 2023

That error usually happens when you try to parse as json something that is completely unrelated. Note that despite other models, this model has the tokenizer.json file also tracked by git LFS because it's bigger than 10MB. I just did git clone and the file is just a pointer to the git LFS object, you might need to do something else to clone it completely.

$ cat tokenizer.json

version https://git-lfs.github.com/spec/v1
oid sha256:62c24cdc13d4c9952d63718d6c9fa4c287974249e16b7ade6d5a85e7bbb75626
size 17082660

danielschnell

May 4, 2023

Jeps, that was the problem needed to pull it. Now it works. Thanks for your help!

Btw. (capturing my own thread ...) I have some more questions:

How long did it take to fine-tune the model with how much GPU power ?
Are you aware of the other Icelandic text resources available at https://clarin.is/en/resources ?

Notably from a size POV:

Icelandic Gigaword Corpus (IGC, 8,2GB)
Icelandic Common Crawl Corpus (4,9GB)

How does the corpus you used relate to these ?

ZJaume

MaCoCu org May 29, 2023

You can find training details here: https://www.dlsi.ua.es/~mespla/MaCoCu_evaluation_report.pdf

danielschnell

Jun 23, 2023

Thx for the link. Closing the issue

danielschnell changed discussion status to closed Jun 23, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment