Can we get a 4bit quantized version?

#5
by yehiaserag - opened

It would help a lot if we can get a 4bit version, since all the 4bit versions out there are either based on the lora or are not working as expected

I quantized the model in google colab and tested it with alpaca.cpp. The quality is a bit improved compared to lora merged version.
I made a magnet link for the quantized version of this (file type is .bin). @chavinlo Can I share the link on github?

The format is ggml.

I quantized the model in google colab and tested it with alpaca.cpp. The quality is a bit improved compared to lora merged version.
I made a magnet link for the quantized version of this (file type is .bin). @chavinlo Can I share the link on github?

sure or you can do it here and I can link it on the readme

Thanks! Here's the link(sorry if it was too long, I used online generator):

magnet:?xt=urn:btih:69fb9b4c1e0888336f5253ae75d3e10a9299ab7d&dn=ggml-alpaca-7b-native-q4.bin&tr=http%3A%2F%2F125.227.35.196%3A6969%2Fannounce&tr=http%3A%2F%2F210.244.71.25%3A6969%2Fannounce&tr=http%3A%2F%2F210.244.71.26%3A6969%2Fannounce&tr=http%3A%2F%2F213.159.215.198%3A6970%2Fannounce&tr=http%3A%2F%2F37.19.5.139%3A6969%2Fannounce&tr=http%3A%2F%2F37.19.5.155%3A6881%2Fannounce&tr=http%3A%2F%2F46.4.109.148%3A6969%2Fannounce&tr=http%3A%2F%2F87.248.186.252%3A8080%2Fannounce&tr=http%3A%2F%2Fasmlocator.ru%3A34000%2F1hfZS1k4jh%2Fannounce&tr=http%3A%2F%2Fbt.evrl.to%2Fannounce&tr=http%3A%2F%2Fbt.rutracker.org%2Fann&tr=https%3A%2F%2Fwww.artikelplanet.nl&tr=http%3A%2F%2Fmgtracker.org%3A6969%2Fannounce&tr=http%3A%2F%2Fpubt.net%3A2710%2Fannounce&tr=http%3A%2F%2Ftracker.baravik.org%3A6970%2Fannounce&tr=http%3A%2F%2Ftracker.dler.org%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.filetracker.pl%3A8089%2Fannounce&tr=http%3A%2F%2Ftracker.grepler.com%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.mg64.net%3A6881%2Fannounce&tr=http%3A%2F%2Ftracker.tiny-vps.com%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.torrentyorg.pl%2Fannounce&tr=https%3A%2F%2Finternet.sitelio.me%2F&tr=https%3A%2F%2Fcomputer1.sitelio.me%2F&tr=udp%3A%2F%2F168.235.67.63%3A6969&tr=udp%3A%2F%2F182.176.139.129%3A6969&tr=udp%3A%2F%2F37.19.5.155%3A2710&tr=udp%3A%2F%2F46.148.18.250%3A2710&tr=udp%3A%2F%2F46.4.109.148%3A6969&tr=udp%3A%2F%2Fcomputerbedrijven.bestelinks.nl%2F&tr=udp%3A%2F%2Fcomputerbedrijven.startsuper.nl%2F&tr=udp%3A%2F%2Fcomputershop.goedbegin.nl%2F&tr=udp%3A%2F%2Fc3t.org&tr=udp%3A%2F%2Fallerhandelenlaag.nl&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.tiny-vps.com%3A6969

Thank you so much for this. I can confirm that the quantized native model from Taiyouillusion's magnet link is legit. Running on alpaca.cpp, it's a big leap forward in response quality compared to the 7B or 13B alpaca-lora models. What a time to be alive!

The format is ggml.

Can you share how you converted the post-trained HF weights back into the standard llama format, for conversion to ggml? Or did you go direct from HF to ggml somehow? I got hung up on a few things, one being that convert-pth-to-ggml.py (from llama.cpp) calls numpy().squeeze() on the data which does not support bfloat16, which alpaca uses. That was a quick fix (not sure if my hack would affect anything, but anyways), but quantize step then fails. From some sleuthing around, it seems like there needs to be a conversion step after the fine tuning to get the weights back into the standard llama format.

I uploaded the script I used in colab to convert the HF model on github: https://github.com/taiyou2000/alpaca-convert-colab/blob/main/alpaca-convert-colab-fixed.ipynb.

when I try running this script, I get first an error about accelerate missing and after installing that, I get:

NameError                                 Traceback (most recent call last)

<ipython-input-4-bd7436545f55> in <module>
      8 tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf")
      9 
---> 10 base_model = LLaMAForCausalLM.from_pretrained(
     11     "chavinlo/alpaca-native",
     12     load_in_8bit=False,

/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2488             init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
   2489         elif load_in_8bit or low_cpu_mem_usage:
-> 2490             init_contexts.append(init_empty_weights())
   2491 
   2492         with ContextManagers(init_contexts):

NameError: name 'init_empty_weights' is not defined

Any hints on fixing this?

Because the upstream llama.cpp repository recently changed the quantized ggml format, any old q4.bin files will stop working, so I had to requantize this. I did manage to get it working. I had to remove the "accelerate" pip3 package, and use a colab runtime with a lot of ram. I was constantly almost running out of disk space while doing the conversion, I just managed to convert it.

here's the magnet link: magnet:?xt=urn:btih:0e51003c8a5610aa713f675891f0a7f87051be1a&dn=ggml-alpaca-7b-native-q4.bin&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

sometimes I find that a magnet link won't work unless a few people have downloaded thru the actual torrent file. you can find it at "suricrasia dot online slash stuff slash ggml-alpaca-7b-native-q4 dot bin dot torrent dot txt" just replace "dot" with "." and "slash" with "/"

Because the upstream llama.cpp repository recently changed the quantized ggml format, any old q4.bin files will stop working, so I had to requantize this. I did manage to get it working. I had to remove the "accelerate" pip3 package, and use a colab runtime with a lot of ram. I was constantly almost running out of disk space while doing the conversion, I just managed to convert it.

here's the magnet link: magnet:?xt=urn:btih:0e51003c8a5610aa713f675891f0a7f87051be1a&dn=ggml-alpaca-7b-native-q4.bin&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

sometimes I find that a magnet link won't work unless a few people have downloaded thru the actual torrent file. you can find it at "suricrasia dot online slash stuff slash ggml-alpaca-7b-native-q4 dot bin dot torrent dot txt" just replace "dot" with "." and "slash" with "/"

Can u post what you changed in google colab?

I actually didn't need to change anything, I just had to run with google colab pro. if you don't, it will ask you to install the "accelerate" package, and that's where the error comes from.

Sign up or log in to comment