TheBloke/Mixtral-8x7B-v0.1-GGUF · How many tokens per second?

Dec 12, 2023

Could someone please share the number of tokens per second they get from running this model if they are running it only on CPU and RAM without GPU?

Hoioi changed discussion title from How many token per second? to How many tokens per second? Dec 12, 2023

tli0312

Dec 12, 2023

7 t/s on CPU/RAM only (Ryzen 5 3600), 10 t/s with 10 layers off load to GPU, 12 t/s with 15 layers off load to GPU

HR1777

Dec 12, 2023

•

edited Dec 12, 2023

7 t/s on CPU/RAM only (Ryzen 5 3600), 10 t/s with 10 layers off load to GPU, 12 t/s with 15 layers off load to GPU

7 t/s on CPU/RAM seems pretty good. How much RAM do you have on your computer? And what interface do you use? text-generation-webui? koboldcpp or what?

mirek190

Dec 12, 2023

Llamacpp
On my rtx 3090 ...around 40 t/s
Version q4k_m ( 30 layers on GPU )

Hoioi

Dec 12, 2023

Thank you for your replies. If anyone else has the statistics, please share with us.

rahulrock12

Dec 12, 2023

Llamacpp
On my rtx 3090 ...around 40 t/s
Version q4k_m ( 30 layers on GPU )

Hi, can you please share the python code used to access the model, i am struggling to find any.

mirek190

Dec 12, 2023

I'm using llamacpp ( one small binary fie ) to run model.

vidyamantra

Dec 15, 2023

Thank you for your replies. If anyone else has the statistics, please share with us.

On RTX 4090 & i9-14900K. Benchmark using llama-bench from llama.cpp.

model	size	params	backend	ngl	threads	t/s pp 512	t/s tg 128
llama 7B mostly Q3_K - Medium	18.96 GiB	46.70 B	CUDA	33	8	205.07	83.16
llama 7B mostly Q3_K - Medium	18.96 GiB	46.70 B	CUDA	33	16	204.48	83.21
llama 7B mostly Q3_K - Medium	18.96 GiB	46.70 B	CUDA	33	24	204.28	83.22
llama 7B mostly Q3_K - Medium	18.96 GiB	46.70 B	CUDA	33	32	203.82	83.17
llama 7B mostly Q4_K - Medium	24.62 GiB	46.70 B	CUDA	27	8	145.54	27.75
llama 7B mostly Q4_K - Medium	24.62 GiB	46.70 B	CUDA	27	16	121.58	25.57
llama 7B mostly Q4_K - Medium	24.62 GiB	46.70 B	CUDA	27	24	147.14	26.41
llama 7B mostly Q4_K - Medium	24.62 GiB	46.70 B	CUDA	27	32	145.23	9.36
llama 7B mostly Q5_K - Medium	30.02 GiB	46.70 B	CUDA	22	8	58.18	15.12
llama 7B mostly Q5_K - Medium	30.02 GiB	46.70 B	CUDA	22	16	49.28	13.8
llama 7B mostly Q5_K - Medium	30.02 GiB	46.70 B	CUDA	22	24	64.25	15.07
llama 7B mostly Q5_K - Medium	30.02 GiB	46.70 B	CUDA	22	32	73.69	12.02
llama 7B mostly Q6_K	35.74 GiB	46.70 B	CUDA	19	8	33.86	10.5
llama 7B mostly Q6_K	35.74 GiB	46.70 B	CUDA	19	16	31.75	9.5
llama 7B mostly Q6_K	35.74 GiB	46.70 B	CUDA	19	24	40.37	10.58
llama 7B mostly Q6_K	35.74 GiB	46.70 B	CUDA	19	32	45.39	8.8
llama 7B mostly Q8_0	46.22 GiB	46.70 B	CUDA	15	8	18.02	7.1
llama 7B mostly Q8_0	46.22 GiB	46.70 B	CUDA	15	16	19.74	5.9
llama 7B mostly Q8_0	46.22 GiB	46.70 B	CUDA	15	24	24.81	6.74
llama 7B mostly Q8_0	46.22 GiB	46.70 B	CUDA	15	32	28.31	5.62

GIOVANITH2

Dec 17, 2023

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

deleted

Dec 17, 2023

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

That is far worse than i get via an old Xeon CPU only, and im using the Q6 model ( latest ooba isn't using my GPU at all. I need to look into it later this week to see what is up )

Hoioi

Dec 17, 2023

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

It seems far less than what it should be. I'm almost sure something is wrong.

mirek190

Dec 18, 2023

•

edited Dec 18, 2023

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

It is extremally slow ... I have ryzen 7950x3d and RTX 3090 getting 30+ tokens/s with q4k_m and with q5 10+ tokens/s (less layers on gpu )

delphijb

Dec 18, 2023

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

Too much layer on GPU, especially with Q5. Try 18 to 20 layers instead.
With 30 GPU layers, it's likely that the excess will be stored in shared video memory (RAM), which is not at all advisable, given that it's the GPU that will be working with this very slow memory.
Also, don't forget that the first response after the model has been loaded into memory can take much longer...