General tips around inference speed?
Maybe I just haven't run this large of a model before but I'm blown away by the difference in speed for this 16b model versus a 12b parameter model like pythia. Are there any speed-up tips? Things I'm doing so far:
- On pytorch 2.0
- Confirmed it's running on CUDA (device=0 and VRAM is soaked)
- torch_dtype=torch.bfloat16
- even tried loading as 8 bit to no noticeable speed-up (but I guess that last one is more to alleviate memory than speed things up).
Any other ideas?
Did pytorch 2.0 gave you noticable improvement?
I actually didn't try Pytorch 1.x so I don't know! The one last thing I was going to try was deepspeed inference (per this https://www.deepspeed.ai/tutorials/inference-tutorial/) but I don't know how much improvement I'm going to see on a single GPU machine.
Try editing the config.json file to say use_cache: true
. That will help.
Try editing the config.json file to say
use_cache: true
. That will help.
thanks.
I've noticed 40% faster inference by using use_cache: true
Edit: I'm peeling this out into it's own thread.
https://huggingface.co/HuggingFaceH4/starchat-alpha/discussions/6