How to use this on GPUs?
I modified the model loading line as follows:
'''
model = BloomForCausalLM.from_pretrained('joaoalvarenga/bloom-8bit', low_cpu_mem_usage=True, device_map="auto")
'''
Using device_map="auto" automatically moves everything to the GPU but when I try generating, I get the following:
'''
RuntimeError: CUDA out of memory. Tried to allocate 13.40 GiB (GPU 0; 31.75 GiB total capacity; 19.45 GiB already allocated; 11.20 GiB free; 19.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON
'''
I have 8 32GB gpus so everything fits. Am I doing this right, or is there a specific way to do decoding using a GPU?
Thanks in advance.