My cpu is only using 50% of its cores.
I am using llama.cpp and the following initializer for this model. CPU is a Ryzen 9 12 core.
main -m mixtral-8x7b-instruct-v0.1.Q8_0.gguf --color -c 30000 --temp 0.0 --repeat_penalty 1.1 -n -12 --instruct --override-kv llama.expert_used_count=int:3 --instruct --reverse-prompt "### Human:"
Only 50% usage of each of the 12 cores. Anyone else notice the same?
I set the number of threads with this:
-t N, --threads N number of threads to use during generation (default: 8)
I am using llama.cpp and the following initializer for this model. CPU is a Ryzen 9 12 core.
main -m mixtral-8x7b-instruct-v0.1.Q8_0.gguf --color -c 30000 --temp 0.0 --repeat_penalty 1.1 -n -12 --instruct --override-kv llama.expert_used_count=int:3 --instruct --reverse-prompt "### Human:"
Only 50% usage of each of the 12 cores. Anyone else notice the same?
That just mean llamacpp have not run at optimal speed on your computer, and cpu is waiting dram to fetch data. Which is called memory bound, and this is quite common for memory intensive program like running a llm.
There are a few methods to mitigate that memory bound effect, such as pre-compute sparsity of the model weight and only streaming a selective of weights. I am not sure how well it is implemented in llamacpp. But it is mostly issue on llamacpp side.