How many tokens per second?
Could someone please share the number of tokens per second they get from running this model if they are running it only on CPU and RAM without GPU?
7 t/s on CPU/RAM only (Ryzen 5 3600), 10 t/s with 10 layers off load to GPU, 12 t/s with 15 layers off load to GPU
7 t/s on CPU/RAM only (Ryzen 5 3600), 10 t/s with 10 layers off load to GPU, 12 t/s with 15 layers off load to GPU
7 t/s on CPU/RAM seems pretty good. How much RAM do you have on your computer? And what interface do you use? text-generation-webui? koboldcpp or what?
Llamacpp
On my rtx 3090 ...around 40 t/s
Version q4k_m ( 30 layers on GPU )
Thank you for your replies. If anyone else has the statistics, please share with us.
Llamacpp
On my rtx 3090 ...around 40 t/s
Version q4k_m ( 30 layers on GPU )
Hi, can you please share the python code used to access the model, i am struggling to find any.
I'm using llamacpp ( one small binary fie ) to run model.
Thank you for your replies. If anyone else has the statistics, please share with us.
On RTX 4090 & i9-14900K. Benchmark using llama-bench from llama.cpp.
model | size | params | backend | ngl | threads | t/s pp 512 | t/s tg 128 |
---|---|---|---|---|---|---|---|
llama 7B mostly Q3_K - Medium | 18.96 GiB | 46.70 B | CUDA | 33 | 8 | 205.07 | 83.16 |
llama 7B mostly Q3_K - Medium | 18.96 GiB | 46.70 B | CUDA | 33 | 16 | 204.48 | 83.21 |
llama 7B mostly Q3_K - Medium | 18.96 GiB | 46.70 B | CUDA | 33 | 24 | 204.28 | 83.22 |
llama 7B mostly Q3_K - Medium | 18.96 GiB | 46.70 B | CUDA | 33 | 32 | 203.82 | 83.17 |
llama 7B mostly Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 27 | 8 | 145.54 | 27.75 |
llama 7B mostly Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 27 | 16 | 121.58 | 25.57 |
llama 7B mostly Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 27 | 24 | 147.14 | 26.41 |
llama 7B mostly Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 27 | 32 | 145.23 | 9.36 |
llama 7B mostly Q5_K - Medium | 30.02 GiB | 46.70 B | CUDA | 22 | 8 | 58.18 | 15.12 |
llama 7B mostly Q5_K - Medium | 30.02 GiB | 46.70 B | CUDA | 22 | 16 | 49.28 | 13.8 |
llama 7B mostly Q5_K - Medium | 30.02 GiB | 46.70 B | CUDA | 22 | 24 | 64.25 | 15.07 |
llama 7B mostly Q5_K - Medium | 30.02 GiB | 46.70 B | CUDA | 22 | 32 | 73.69 | 12.02 |
llama 7B mostly Q6_K | 35.74 GiB | 46.70 B | CUDA | 19 | 8 | 33.86 | 10.5 |
llama 7B mostly Q6_K | 35.74 GiB | 46.70 B | CUDA | 19 | 16 | 31.75 | 9.5 |
llama 7B mostly Q6_K | 35.74 GiB | 46.70 B | CUDA | 19 | 24 | 40.37 | 10.58 |
llama 7B mostly Q6_K | 35.74 GiB | 46.70 B | CUDA | 19 | 32 | 45.39 | 8.8 |
llama 7B mostly Q8_0 | 46.22 GiB | 46.70 B | CUDA | 15 | 8 | 18.02 | 7.1 |
llama 7B mostly Q8_0 | 46.22 GiB | 46.70 B | CUDA | 15 | 16 | 19.74 | 5.9 |
llama 7B mostly Q8_0 | 46.22 GiB | 46.70 B | CUDA | 15 | 24 | 24.81 | 6.74 |
llama 7B mostly Q8_0 | 46.22 GiB | 46.70 B | CUDA | 15 | 32 | 28.31 | 5.62 |
hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.
hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.
That is far worse than i get via an old Xeon CPU only, and im using the Q6 model ( latest ooba isn't using my GPU at all. I need to look into it later this week to see what is up )
hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.
It seems far less than what it should be. I'm almost sure something is wrong.
hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.
It is extremally slow ... I have ryzen 7950x3d and RTX 3090 getting 30+ tokens/s with q4k_m and with q5 10+ tokens/s (less layers on gpu )
hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.
Too much layer on GPU, especially with Q5. Try 18 to 20 layers instead.
With 30 GPU layers, it's likely that the excess will be stored in shared video memory (RAM), which is not at all advisable, given that it's the GPU that will be working with this very slow memory.
Also, don't forget that the first response after the model has been loaded into memory can take much longer...