The difference between Q8 and F16 models
Is there any big difference between the Q8 and F16 models? Like will F16 model perform 2 times better than Q8?
Is there any big difference between the Q8 and F16 models? Like will F16 model perform 2 times better than Q8?
No the difference is so small it can is almost not be measured. I would say i1-Q5_K_M and larger have no meaningful difference to the unquantized model. Instead of Q8 I recommend using i1-Q6 from https://huggingface.co/mradermacher/Llama-3.2-3B-Instruct-uncensored-i1-GGUF. Here some plots I created a week ago for some other models. The small "i"-prefix on the plot means that wighted/imatrix quants are used.
Is there any big difference between the Q8 and F16 models? Like will F16 model perform 2 times better than Q8?
No the difference is so small it can is almost not be measured. I would say i1-Q5_K_M and larger have no meaningful difference to the unquantized model. Instead of Q8 I recommend using i1-Q6 from https://huggingface.co/mradermacher/Llama-3.2-3B-Instruct-uncensored-i1-GGUF. Here some plots I created a week ago for some other models. The small "i"-prefix on the plot means that wighted/imatrix quants are used.
Thanks for all those graphs, im not very familiar with LLMs but i think i understood