CUDA error when initialising model with text-generation-inference
Hi everyone, I was trying to deploy the model using the text-generation-inference toolkit in a AWS EC2 G5.24xLarge with 96GB of GPU. (4 GPUs of 24GB each), but when the model is initialising I receive the following message (once per each GPU):
--torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 1 has a total capacty of 21.99 GiB of which 77.00 MiB is free. Process 27730 has 21.90 GiB memory in use. Of the allocated memory 21.47 GiB is allocated by PyTorch, and 42.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF--
I have used this command to launch the service:
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v /dev/data:/data ghcr.io/huggingface/text-generation-inference:1.3.0 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --num-shard 4 --max-batch-total-tokens 1024000 --max-total-tokens 32000
Seems like pythorch is reserving GPU memory causing a failure in the load of the model but I don't know how to face this issue. Somebody can help me to understand the problem or how to figure it out?
Thanks in advance.
Would try to quantize the model with this: https://huggingface.co/docs/text-generation-inference/conceptual/quantization or run it in float16. Not super familiar with TGI but you might need more memory for the max batch total token you are using
What worked for me was to enable device_map="auto".
So in the line where you load the model change it tomodel = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
This makes the model use all 4 GPUs
What worked for me was to enable device_map="auto".
So in the line where you load the model change it tomodel = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
This makes the model use all 4 GPUs
Hi, thanks for the response, how can I apply this change using the TGI?
thanks in advance.
Hi, finally I was able to run the model along with TGI using an in-place quantisation technique (I've supposed my current setup is not enough to run the model), also I used the default value for the flag --max-total-tokens.
Here is the command I used in case it is useful for someone else:
sudo docker run -d --gpus all --shm-size 1g -p $port:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --sharded true --num-shard 4 --quantize eetq
Glad you made it work