tokenizer的vocab_size是39408，而 config.json 里面的 vocab_size 是 39424，哪里有问题呢？

by tanguofu - opened Jun 20, 2023

Jun 20, 2023

求指导

qiyang

Fengshenbang-LM org Sep 4, 2023

请看这个讨论。训练框架为了模型并行（要切vocab embedding成mp整数倍）会补 dummy token 成 39424。

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

tokenizer的vocab_size是39408， 而 config.json 里面的 vocab_size 是 39424， 哪里有问题呢？