matsuo-lab/weblab-10b · Vocabulary seems to be mostly English

I checked the vocabulary that has more than 3 bytes chars (because most Japanese characters are longer than 3 bytes) and the result was 0. I also want to know how they train the tokenizer.🧐

Token with more than 3 bytes chars
 matsuo-lab/weblab-10b: 0 / 50254
 rinna/bilingual-gpt-neox-4b 41599 / 65536

The code to count above is as follows.

modelnames =["matsuo-lab/weblab-10b","rinna/bilingual-gpt-neox-4b"]
model_dict = {}
for name in modelnames:
  tokenizer = AutoTokenizer.from_pretrained(name)
  vocab = tokenizer.convert_ids_to_tokens(range(tokenizer.vocab_size))
  model_dict[name]=vocab

def has_multibyte_chars(input_str):
    for char in input_str:
        return len(char.encode('utf-8')) > 2
print("Token with more than 3 bytes chars")

cnt =0
for modelname in modelnames:
  for t in model_dict[modelname]:
    if has_multibyte_chars(t) >0:
      cnt +=1
  print(" "+modelname,cnt,"/",len(model_dict[modelname]))