Model loading taking too much GPU memory
Hey, when trying to load the model using the code given in the repo card, it keep giving me CUDA out of memory error. I am using NVIDIA V100 with 16 GB RAM. Given that I have run LLMs with more parameters as well as speech-to-text models on this GPU, this doesn't make sense to me. I'm using the exact code given in the repo card. Am I doing something wrong?
Hello Tehreem and thanks for trying the model
Our model will run on 16GB GPUs only in Quantization mode, you can find the sample code here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#quantized-versions-through-bitsandbytes
You can also find our recommended GPU requirements here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#gpu-requirements
Finally, here is a probable technical explanation of why you got OOM:
- Our model is 9B parameters with each parameter represented as BF/FP16 (16-bit floating-point)
- This means that 9 billion parameters will be represented by 18 billion bytes, with each parameter requiring 2 bytes (16 bits).
- To find the amount of memory needed, you will then need to divide 18B bytes by 1,073,741,824 (since 1GB=1,073,741,824 bytes)
- Therefor, you will need 16.76 GB of GPU memory only to load the weights
Thanks for your reply @karimouda ! I was able to run it using a multi-GPU setup.