General tips around inference speed?

by jloganolson - opened May 10, 2023

May 10, 2023

Maybe I just haven't run this large of a model before but I'm blown away by the difference in speed for this 16b model versus a 12b parameter model like pythia. Are there any speed-up tips? Things I'm doing so far:

On pytorch 2.0
Confirmed it's running on CUDA (device=0 and VRAM is soaked)
torch_dtype=torch.bfloat16
even tried loading as 8 bit to no noticeable speed-up (but I guess that last one is more to alleviate memory than speed things up).

Any other ideas?

MichaelFried

May 11, 2023

•

edited May 11, 2023

Did pytorch 2.0 gave you noticable improvement?

jloganolson

May 12, 2023

I actually didn't try Pytorch 1.x so I don't know! The one last thing I was going to try was deepspeed inference (per this https://www.deepspeed.ai/tutorials/inference-tutorial/) but I don't know how much improvement I'm going to see on a single GPU machine.