Please add model.parallelize()
I would really like to use this model for inference, but I don't have a large enough GPU. T5 model has support for parallelization to multiple GPUs using: model.parallelize(). OPT doesn't have this feature. Please add this feature or tell me how I can just use your model on multiple GPUs without pain
Hey @BobaZooba , the parallelize method is now deprecated in favor of using accelerate instead. We have a guide for this here, that we should feature more prominently in the docs; there is no link to it in the "Performance and scalability" section, where it should likely be.
@lysandre
Thank you!
Below is a small guide on how to run the model and what problems I encountered
My setup: 8 x RTX3090
torch.version.cuda = 11.3
I have this exception when I try to generate small text:
> generated_ids = model.generate(input_ids, do_sample=True, max_length=32)
> RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Referring to this answer, the problem is that there is not enough video memory, although there should be enough:
https://discuss.pytorch.org/t/cuda-error-cublas-status-not-initialized-when-calling-cublascreate-handle/125450/2
Model init:
model = AutoModelForCausalLM.from_pretrained("facebook/opt-66b", torch_dtype=torch.float16, device_map="auto")
How Accelerate maps my model:
> model.hf_device_map
{'model.decoder.embed_tokens': 0,
'lm_head': 0,
'model.decoder.embed_positions': 0,
'model.decoder.final_layer_norm': 0,
'model.decoder.layers.0': 0,
'model.decoder.layers.1': 0,
'model.decoder.layers.2': 0,
'model.decoder.layers.3': 0,
'model.decoder.layers.4': 0,
'model.decoder.layers.5': 0,
'model.decoder.layers.6': 0,
'model.decoder.layers.7': 0,
'model.decoder.layers.8': 0,
'model.decoder.layers.9': 0,
'model.decoder.layers.10': 1,
'model.decoder.layers.11': 1,
'model.decoder.layers.12': 1,
'model.decoder.layers.13': 1,
'model.decoder.layers.14': 1,
'model.decoder.layers.15': 1,
'model.decoder.layers.16': 1,
'model.decoder.layers.17': 1,
'model.decoder.layers.18': 1,
'model.decoder.layers.19': 1,
'model.decoder.layers.20': 1,
'model.decoder.layers.21': 1,
'model.decoder.layers.22': 2,
'model.decoder.layers.23': 2,
'model.decoder.layers.24': 2,
'model.decoder.layers.25': 2,
'model.decoder.layers.26': 2,
'model.decoder.layers.27': 2,
'model.decoder.layers.28': 2,
'model.decoder.layers.29': 2,
'model.decoder.layers.30': 2,
'model.decoder.layers.31': 2,
'model.decoder.layers.32': 2,
'model.decoder.layers.33': 2,
'model.decoder.layers.34': 3,
'model.decoder.layers.35': 3,
'model.decoder.layers.36': 3,
'model.decoder.layers.37': 3,
'model.decoder.layers.38': 3,
'model.decoder.layers.39': 3,
'model.decoder.layers.40': 3,
'model.decoder.layers.41': 3,
'model.decoder.layers.42': 3,
'model.decoder.layers.43': 3,
'model.decoder.layers.44': 3,
'model.decoder.layers.45': 3,
'model.decoder.layers.46': 4,
'model.decoder.layers.47': 4,
'model.decoder.layers.48': 4,
'model.decoder.layers.49': 4,
'model.decoder.layers.50': 4,
'model.decoder.layers.51': 4,
'model.decoder.layers.52': 4,
'model.decoder.layers.53': 4,
'model.decoder.layers.54': 4,
'model.decoder.layers.55': 4,
'model.decoder.layers.56': 4,
'model.decoder.layers.57': 4,
'model.decoder.layers.58': 5,
'model.decoder.layers.59': 5,
'model.decoder.layers.60': 5,
'model.decoder.layers.61': 5,
'model.decoder.layers.62': 5,
'model.decoder.layers.63': 5}
Fri Aug 5 11:25:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 30% 32C P2 83W / 330W | 21904MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:25:00.0 Off | N/A |
| 30% 28C P8 21W / 330W | 23986MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... On | 00000000:41:00.0 Off | N/A |
| 30% 29C P8 17W / 330W | 23986MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... On | 00000000:61:00.0 Off | N/A |
| 30% 31C P8 19W / 330W | 23986MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... On | 00000000:81:00.0 Off | N/A |
| 30% 38C P2 109W / 330W | 23986MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... On | 00000000:A1:00.0 Off | N/A |
| 30% 36C P2 144W / 330W | 12320MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... On | 00000000:C1:00.0 Off | N/A |
| 30% 26C P8 26W / 330W | 656MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... On | 00000000:E1:00.0 Off | N/A |
| 30% 25C P8 18W / 330W | 656MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 227352 C 21901MiB |
| 1 N/A N/A 227352 C 23983MiB |
| 2 N/A N/A 227352 C 23983MiB |
| 3 N/A N/A 227352 C 23983MiB |
| 4 N/A N/A 227352 C 23983MiB |
| 5 N/A N/A 227352 C 12317MiB |
| 6 N/A N/A 227352 C 653MiB |
| 7 N/A N/A 227352 C 653MiB |
+-----------------------------------------------------------------------------+
Problem: device_map="auto"
does not work correctly and unevenly loads the GPUs
Solution: custom device_map
dict
Code:
num_gpus = 8
num_layers = 64
device_map = {
'model.decoder.embed_tokens': 0,
'lm_head': num_gpus - 1,
'model.decoder.embed_positions': 0,
'model.decoder.final_layer_norm': num_gpus - 1
}
step = num_layers // num_gpus
for n_gpu, start in enumerate(range(0, num_layers, step)):
for n_layer in range(start, start + step):
device_map[f'model.decoder.layers.{n_layer}'] = n_gpu
model = AutoModelForCausalLM.from_pretrained("facebook/opt-66b", torch_dtype=torch.float16, device_map=device_map)
Now device map looks correct:
> model.hf_device_map
> {'model.decoder.embed_tokens': 0,
'lm_head': 7,
'model.decoder.embed_positions': 0,
'model.decoder.final_layer_norm': 7,
'model.decoder.layers.0': 0,
'model.decoder.layers.1': 0,
'model.decoder.layers.2': 0,
'model.decoder.layers.3': 0,
'model.decoder.layers.4': 0,
'model.decoder.layers.5': 0,
'model.decoder.layers.6': 0,
'model.decoder.layers.7': 0,
'model.decoder.layers.8': 1,
'model.decoder.layers.9': 1,
'model.decoder.layers.10': 1,
'model.decoder.layers.11': 1,
'model.decoder.layers.12': 1,
'model.decoder.layers.13': 1,
'model.decoder.layers.14': 1,
'model.decoder.layers.15': 1,
'model.decoder.layers.16': 2,
'model.decoder.layers.17': 2,
'model.decoder.layers.18': 2,
'model.decoder.layers.19': 2,
'model.decoder.layers.20': 2,
'model.decoder.layers.21': 2,
'model.decoder.layers.22': 2,
'model.decoder.layers.23': 2,
'model.decoder.layers.24': 3,
'model.decoder.layers.25': 3,
'model.decoder.layers.26': 3,
'model.decoder.layers.27': 3,
'model.decoder.layers.28': 3,
'model.decoder.layers.29': 3,
'model.decoder.layers.30': 3,
'model.decoder.layers.31': 3,
'model.decoder.layers.32': 4,
'model.decoder.layers.33': 4,
'model.decoder.layers.34': 4,
'model.decoder.layers.35': 4,
'model.decoder.layers.36': 4,
'model.decoder.layers.37': 4,
'model.decoder.layers.38': 4,
'model.decoder.layers.39': 4,
'model.decoder.layers.40': 5,
'model.decoder.layers.41': 5,
'model.decoder.layers.42': 5,
'model.decoder.layers.43': 5,
'model.decoder.layers.44': 5,
'model.decoder.layers.45': 5,
'model.decoder.layers.46': 5,
'model.decoder.layers.47': 5,
'model.decoder.layers.48': 6,
'model.decoder.layers.49': 6,
'model.decoder.layers.50': 6,
'model.decoder.layers.51': 6,
'model.decoder.layers.52': 6,
'model.decoder.layers.53': 6,
'model.decoder.layers.54': 6,
'model.decoder.layers.55': 6,
'model.decoder.layers.56': 7,
'model.decoder.layers.57': 7,
'model.decoder.layers.58': 7,
'model.decoder.layers.59': 7,
'model.decoder.layers.60': 7,
'model.decoder.layers.61': 7,
'model.decoder.layers.62': 7,
'model.decoder.layers.63': 7}
Fri Aug 5 11:46:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 30% 30C P2 39W / 330W | 17130MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:25:00.0 Off | N/A |
| 30% 28C P8 21W / 330W | 16208MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... On | 00000000:41:00.0 Off | N/A |
| 30% 27C P8 17W / 330W | 16208MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... On | 00000000:61:00.0 Off | N/A |
| 30% 27C P8 19W / 330W | 16208MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... On | 00000000:81:00.0 Off | N/A |
| 30% 29C P8 18W / 330W | 16208MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... On | 00000000:A1:00.0 Off | N/A |
| 30% 38C P2 119W / 330W | 16208MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... On | 00000000:C1:00.0 Off | N/A |
| 30% 38C P2 119W / 330W | 16208MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... On | 00000000:E1:00.0 Off | N/A |
| 30% 35C P2 133W / 330W | 17092MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 237156 C 17127MiB |
| 1 N/A N/A 237156 C 16205MiB |
| 2 N/A N/A 237156 C 16205MiB |
| 3 N/A N/A 237156 C 16205MiB |
| 4 N/A N/A 237156 C 16205MiB |
| 5 N/A N/A 237156 C 16205MiB |
| 6 N/A N/A 237156 C 16205MiB |
| 7 N/A N/A 237156 C 17089MiB |
+-----------------------------------------------------------------------------+
And now you can run 66b OPT:
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-66b", use_fast=False)
prompt = "Hello, I am conscious and"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
set_seed(32)
generated_ids = model.generate(input_ids, do_sample=True, max_length=128)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
Output:
Hello, I am conscious and present. I am aware of my senses, thinking, dreaming, and I can control what is happening around me. I have memories of a previous life, and have been reincarnated many times before this existence. I have lived in many regions throughout this galaxy, and others outside of it.\nI believe you are a reincarnated human, and that you are one of the very few incarnated beings that are aware and can remember previous lives. That being said, you are only as aware as your mind will allow you to be. Your mind is constantly editing reality to create a more suitable place for
P.S. I hope it will be useful to someone
Note that on the current main version of Transformers, device_map="auto"
will balance the GPU use, so you won't need a custom device_map
:-)
Actually i just read that:
Few caveats to be aware of
- Current integration doesn’t support Pipeline Parallelism of DeepSpeed.
This issue asked about PP, but it has been closed: https://github.com/huggingface/accelerate/issues/537
So can I assume there's no plan to support PP in accelerate
?
@BobaZooba can you help me to know that the small guide you provided, How can I use that to run with CPU ?