vocab.txt

by jzhang86 - opened Aug 2, 2023

Discussion

jzhang86

Aug 2, 2023

where can I find the vocab.txt for this multilingual model?

intfloat

Owner Aug 3, 2023

The vocabulary is based on sentencepiece instead of word piece like BERT.

You can use the following code to print the vocab:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')
print(tokenizer.vocab)

jzhang86

Aug 3, 2023

@intfloat Thank you. So you are saying I can write the vocab.txt with the tokenizer.vocab value? I don't know why the multilingual e5 models don't come with vocab.txt just like the english e5 model does.
The reason I am asking is I am trying to convert this model to ggml format using bert.cpp, which requires vocab.txt.

intfloat

Owner Aug 4, 2023

As far as I know, only models based on bert have vocab.txt, models like t5 and xlm-roberta do not have this file.

Multilingual e5 models are based on xlm-roberta instead of bert.

I guess you should not try to run this model with bert codebase.

sakthivel-radhakrishnan

Apr 23

@intfloat This model supports 94 languages. How to choose only specific languages from the list? I need only 40 languages

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment