roneneldan/TinyStories-1M · Actual number of parameters?

Oh, I think I may have figured it out!

From the TinyStories paper:

"Our models are available on Huggingface named TinyStories-1M/3M/9M/28M/33M/1Layer/2Layer and TinyStories-Instruct-∗. We
use GPT-Neo architecture with window size 256 and context length 512. We use GPT-Neo tokenizer but only keep the top 10K most
common tokens."

So, they only kept the top 10K most common tokens for the training. But the models here have the full vocabulary size 50257 for their embedding matrices. So I guess for distribution the trained models were sort of filled out (with what, zeros? garbage?) to plug-and-play into a much more common tokenizer?

The math works out, since instead of 3216448 (embedding matrix) + 529536 = 3745984 we would then have 640000 + 529536 = 1169536. This makes a lot more sense to me as a "-1M" model so I bet this is how it was trained.