I just wanna thank everyone that worked on this
Amazing job. This is really uncanny. I'm running this on a rtx2070 and an i5 from years ago on my second computer. This is nuts, I'm getting 6 tokens per second. This opens up a whole other world of possibilities of running local applications for me. Of course it's not 100% perfect, but what do you expect with a model this size? How incredibly far we've gotten.
Hey Ryann, would you mind sharing how you did that?
I am trying on RTX3070 but failed with error
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True
and pass a custom
device_map
to from_pretrained
. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.