Vocabulary seems to be mostly English
#3
by
jforan
- opened
The tokenizer.json seems to be the same as for the original GPT-NeoX model.
Is there a reason you didn't retrain the vocabulary so as to have more Japanese subtokens? I would have guessed that this would give even better performance in Japanese.
I checked the vocabulary that has more than 3 bytes chars (because most Japanese characters are longer than 3 bytes) and the result was 0. I also want to know how they train the tokenizer.π§
Token with more than 3 bytes chars
matsuo-lab/weblab-10b: 0 / 50254
rinna/bilingual-gpt-neox-4b 41599 / 65536
The code to count above is as follows.
modelnames =["matsuo-lab/weblab-10b","rinna/bilingual-gpt-neox-4b"]
model_dict = {}
for name in modelnames:
tokenizer = AutoTokenizer.from_pretrained(name)
vocab = tokenizer.convert_ids_to_tokens(range(tokenizer.vocab_size))
model_dict[name]=vocab
def has_multibyte_chars(input_str):
for char in input_str:
return len(char.encode('utf-8')) > 2
print("Token with more than 3 bytes chars")
cnt =0
for modelname in modelnames:
for t in model_dict[modelname]:
if has_multibyte_chars(t) >0:
cnt +=1
print(" "+modelname,cnt,"/",len(model_dict[modelname]))