Speed

#1
by ramzeez88 - opened

I am wondering what kind of speed can i expect from this model if i split the layers between cpu and gpu. i have 4 cores/8threads cpu , 16gb ram , and nvidia 1070ti 8gb vram.

I am wondering what kind of speed can i expect from this model if i split the layers between cpu and gpu. i have 4 cores/8threads cpu , 16gb ram , and nvidia 1070ti 8gb vram.

It would be amazing if 16GB of RAM is enough for this model. Even if your PC doesn't hang, the output speed will still be very slow due to the use of swap file

Yeah 16GB is going to be tight. But it should be possible - offload 8GB of layers to GPU, and then the smaller quants, eg Q3_K_M, will use around 10GB RAM, so it should just fit.

Speed is not going to be great on account of your CPU and GPU both being weak and old (1070Ti is very old now) - expect it to be very slow. But it hopefully won't swap.

Maybe 1 token a second, or 2 tokens a second? Something like that.

What is really crazy i tested various 34B models with 20gigs of ram and RTX 2060 super,with various layers offloading percentage.And my cpu is just a Ryzen 5 2600x with 12threads and i get about 4-5 tokens per second which is not fast but also not so bad on such hardware,and i also using 4 bits Q4_K_S.What you think?

Sign up or log in to comment