Discrepancy in vocab size

by richardlian - opened Mar 31

Mar 31

In config.json, it states that the vocab size is 61952, however, if we access vocab_size attribute on the tokenizer object, it states that the vocab size is 61873. What is the reason for this discrepancy? Is it okay if I change config.json to match the tokenizer's vocab size?

Splend1dchan

MediaTek Research org Mar 31

•

edited Mar 31

Hi Richard,

The reason is that it is padded, because of tensor parallelism requirements. The last dimensions of the embeddings are not used, and the effective vocab size is 61873. You are allowed to change config.json, but I believe you would also need to change the model embedding matrix as well.

Best,
Jeff

richardlian

Mar 31

Thanks for the reply!

I'm trying to get the .vocab_size attribute to match with config.json so I want to add extra padding tokens. The reason I am asking is because NeMo's conversion script to transform a Transformers model into their .nemo format checks that the vocab sizes are consistent. I probably don't want to disable any checks that the script makes.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment