Discrepancy in vocab size
In config.json
, it states that the vocab size is 61952
, however, if we access vocab_size attribute on the tokenizer object, it states that the vocab size is 61873
. What is the reason for this discrepancy? Is it okay if I change config.json
to match the tokenizer's vocab size?
Hi Richard,
The reason is that it is padded, because of tensor parallelism requirements. The last dimensions of the embeddings are not used, and the effective vocab size is 61873. You are allowed to change config.json, but I believe you would also need to change the model embedding matrix as well.
Best,
Jeff
Thanks for the reply!
I'm trying to get the .vocab_size
attribute to match with config.json
so I want to add extra padding tokens. The reason I am asking is because NeMo's conversion script to transform a Transformers model into their .nemo
format checks that the vocab sizes are consistent. I probably don't want to disable any checks that the script makes.