Q6_K appears to be broken (And maybe other K-Quants as well)
@TheBloke just wanted to bring your attention to this GitHub comment in case you have not seen in yet. It strongly suggest that there is something seriously wrong with the Q6_K quant as it has far higher perplexity than even Q4_0 when using 2 experts.
The test does not include Q4_K_M or Q5_K_M, I think it would be a good idea to run a perplexity test on those two as well to make sure this issue does not affects all of the K variants. It's also pretty odd that the K variants have the exact same file size as the Non-K variants, I don't know if that is in any way connected but it's certainly an oddity and it's present in all models based on Mixtral.
Check the frequency base. For me, on Q8_K, loaded using oogabooga, I get llama_new_context_with_model: freq_base = 10000.0
in the console log by default, which is AFAIK not right. And indeed the model falls apart with increasing context length. If I override it with rope_freq_base
set to 1000000, the model works much better. I didn't try measuring the perplexity though.
ust wanted to bring your attention to this GitHub comment in case you have not seen in yet. It strongly suggest that there is something seriously wrong with the Q6_K quant as it has far higher perplexity than even Q4_0 when using 2 experts.
I didn't follow that discussion from the start, but just read it and saw this:
"Re-tested Q6_K with the latest build, and the results are looking good now. Not sure about the source of the previous issue."
ust wanted to bring your attention to this GitHub comment in case you have not seen in yet. It strongly suggest that there is something seriously wrong with the Q6_K quant as it has far higher perplexity than even Q4_0 when using 2 experts.
I didn't follow that discussion from the start, but just read it and saw this:
"Re-tested Q6_K with the latest build, and the results are looking good now. Not sure about the source of the previous issue."
Ah that's good to see. In the original test the Q6_K had a perplexity of 4.86 which was way above what it should have been. The new result of 3.90 is good to see, it seems to have been a false alarm in that case.
It would have been pretty bad if all of the K Quants had been underperforming this whole time, as that's the main type people use these days.
Given those results I think it's safe to close this issue.