Extremely slow on 16GB Macbook M1 Pro
Hi,
First thank you for the effort to do the 4-bit quantization so it can be run with llama.cpp on local computers.
I'm using it with the latest llama.cpp code, on a 16GB RAM macbook with M1 Pro chip. It was able to run the Alpaca-7b-4bit model easily, using about 2GB of RAM and 2-3 cores, and the token generating speed is about 6 tokens per second.
However, running the vicuna-13b-4bit model, the token generating speed is almost 1 token per minute (not a typo, and not per second) which renders it almost useless. I checked Mac OS's activity monitor, surprisingly RAM seems still around 2GB, the CPU usage increased to about 4 cores. I'm not sure if you have tested on Mac as well, and what could be the issue.
Thanks
Bruce
This works fine on my m1 studio. My guess is you're loading past your ram capacity and using your ssd as a swap or page file. This causes a dramatic shift in speed. You could try using --mlock in your ./main command which will attempt to use only ram. My guess is you'll probably see segmentation faults and other crashes of the model.
Another modifier you can use, but maybe only useful on your 7b models is -t # which is how many threads you'd like the model to use. If you have a 8c thread m1 mb pro I'd say 6 is the highest you should go. Or if you have 10c you could use -t 8
The -t command I don't think will speed up your model on this 13b as your most likely issue is the model (or parts of the model) being sent to your "virtual ram" aka swap aka page file in the ssd.
Thank you for the answer. And I'm monitoring the recent mmap change on llama.cpp.
I will also try on another Linux machine which has 32GB physical RAM to isolate the issue.
Relevant issue on GitHub: https://github.com/ggerganov/llama.cpp/issues/767
Indeed it is due to the recent mmap change. Here is what I found:
On 16GB Macbook M1 Pro, use --mlock solve the problem. it is generating tokens quite fast, and stable (for now)
these are some great tips for those of us traveling about or yet to fully invest in larger more capable machines :)
what would you say is the most efficient model to run locally with only 8 - 16 gb of ram and 256 - 512 ssd? also do you think adding an external ssd would help at all?
these are some great tips for those of us traveling about or yet to fully invest in larger more capable machines :)
what would you say is the most efficient model to run locally with only 8 - 16 gb of ram and 256 - 512 ssd? also do you think adding an external ssd would help at all?
You can start with a good 7B model (e.g. Zephyr), 4bit quantization, GGUF model. That should fit into your RAM fine. SSD does not really help, since the model needs to fit into RAM to make it run fast.
If you close your other apps and free up enough RAM, then maybe 13B will run as well. But it might be quite a bit slower.
Okay very cool! Will give that model a try! what do you think some of the cooler things I could do are with that model?