vocab.txt
where can I find the vocab.txt for this multilingual model?
The vocabulary is based on sentencepiece instead of word piece like BERT.
You can use the following code to print the vocab:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')
print(tokenizer.vocab)
@intfloat
Thank you. So you are saying I can write the vocab.txt with the tokenizer.vocab value? I don't know why the multilingual e5 models don't come with vocab.txt just like the english e5 model does.
The reason I am asking is I am trying to convert this model to ggml format using bert.cpp, which requires vocab.txt.
As far as I know, only models based on bert have vocab.txt
, models like t5 and xlm-roberta do not have this file.
Multilingual e5 models are based on xlm-roberta instead of bert.
I guess you should not try to run this model with bert codebase.
@intfloat This model supports 94 languages. How to choose only specific languages from the list? I need only 40 languages