3bit version
How about adding 3 bit version for quality testing?
to see how much worse it perform in comparison to 4 bit version, because 4 bit gets oom somewhere above 1000 tokens on 24gb gpu
this 4 bit version works pretty good.
On KoboldAI, I can run the 4bit non-groupsize model at full context in windows on my 3090. Ooba takes up more vram for some reason; I'm guessing that's what you're using.
I am uploading a 3bit-128g quant of this model. It might take a couple hours since HF seems to be having some troubles right now and is refusing to let me create a model card. The wikitext 2 ppl is 12% worse than 4bit non-groupsize which is a substantial loss in coherence. But the file is 17% smaller, which should roughly translate into similar VRAM savings. You will find it here: https://huggingface.co/tsumeone/llama-30b-supercot-3bit-128g-cuda
Just want to add that I also tried quantizing a 3bit-32g version to see if the ppl could be improved, but the file size ended up 2% larger than 4bit non-groupsize while still having 5% worse ppl. Basically no reason to even consider that one since it will use more VRAM and also be less coherent.