Broken tokenizer

#77

by anferico - opened 30 days ago

30 days ago

@patrickvonplaten I think you broke the tokenizer by deleting "tokenizer.model". Now this throws an error:

 tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

File ".../lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
  with open(self.vocab_file, "rb") as f:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

luninaries

30 days ago

as a temporary fix I set the tokenizer revision to f67d0f47df7707eddf3fb61000e3e8713074f45c

Bojun-Feng

30 days ago

•

edited 30 days ago

Glad to hear I'm not the only one bogged by this error lol. Here is a more complete code snippet for easier paste and run:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load with a specific commit hash (the one before deleting the `tokenizer.model`)
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    revision="f67d0f47df7707eddf3fb61000e3e8713074f45c"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    revision="f67d0f47df7707eddf3fb61000e3e8713074f45c",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

patrickvonplaten

Mistral AI_ org 29 days ago

Sorry about - I confused a HF tokenizer file with mistral common one. Reverted it - should work again :-)

patrickvonplaten changed discussion status to closed 29 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment