Why does the model need too much CPU memory?
I am trying to load the model on multiple GPUs using device_map="auto". I am aware that it has to be loaded to the CPU memory first. However, since the model is in float16, I expect it to take abound 70GB CPU memory. I have 470GB free CPU memory so I expect it to be more than enough. However, when I load the model and check CPU usage using htop, I find out that all 32 cores are used to 100% and that the memory usage grows fast until it fills all the 470GB and then the python process that loads the model is killed by the os.
Do you experience such high CPU memory usage? Do you know what could be the issue here and how to mitigate it?
PS: I used the exact same code to load cognitivecomputations/dolphin-2_2-yi-34b, which has the exact same architecture as this one, and the model was successfully loaded without this excessive CPU memory usage. The only difference I can see between both models (except the actual values of the weights) is that the model files in the cognitivecomputations/dolphin-2_2-yi-34b model are .bin files. On the other hand, this model has .safetensors files. Can this somehow be the cause of the problem?
Thanks in advance.
Upgrading transformers from 4.38.1 to 4.38.2 solved the problem. Not sure, however, what happened between both versions: https://github.com/huggingface/transformers/releases v4.38.2 states that changes mostly fixed backward compatibility issues with Llama (the architecture underlying this model) and Gemma.
The most intriguing thing for me is that the changes reduced CPU memory usage during model loading by at least 400/65=6 times the model size on disk. I would be grateful if someone could explain why the changes are so effective.
Thanks in advance.