GGML Quantize?
So, I'm able to run the sample code (although frustratingly, it want a token even with a fully local copy, and even if I force the config file to hardcode my local path).
I'm trying to use GGML's GPT-NeoX support to convert the model. I've swapped the tokenizer with the LlamaTokenizer.from_pretrained("novelai/nerdstash-tokenizer-v1", additional_special_tokens=['▁▁'])
from the example and also swapped the AutoModelForCausalLM() as well and am able to generate a ggml-model-f16.bin
, however if I try to run gpt-neox
on the model, here's the error I get:
bin/gpt-neox -m /models/llm/jp-stablelm/stabilityai_japanese-stablelm-instruct-alpha-7b/ggml-model-f16.bin
main: seed = 1691765429
gpt_neox_model_load: loading model from '/models/llm/jp-stablelm/stabilityai_japanese-stablelm-instruct-alpha-7b/ggml-model-f16.bin' - please wait ...
gpt_neox_model_load: n_vocab = 65535
gpt_neox_model_load: n_ctx = 1024
gpt_neox_model_load: n_embd = 4096
gpt_neox_model_load: n_head = 32
gpt_neox_model_load: n_layer = 32
gpt_neox_model_load: n_rot = 32
gpt_neox_model_load: par_res = 1
gpt_neox_model_load: ftype = 1
gpt_neox_model_load: qntvr = 0
gpt_neox_model_load: ggml ctx size = 16390.52 MB
gpt_neox_model_load: memory_size = 512.00 MB, n_mem = 32768
gpt_neox_model_load: unknown tensor 'transformer.embed_in.weight' in model file
main: failed to load model from '/models/llm/jp-stablelm/stabilityai_japanese-stablelm-instruct-alpha-7b/ggml-model-f16.bin'
Just checking in to see if anyone has had better luck with GGML quantize support/how different this model is from other GPT-NeoX or StableLM models that have been quantized?
Me too...
I'm not entirely certain, but I believe the reason is that in main.cpp
file, they are currently hard-coding layer names with prefix gpt_neox
.
This approach works well with stablelm-base-alpha-7b
(you can check it at pytorch_model.bin.index.json
file). However, in the case of japanese-stablelm-base-alpha-7b
, they are using prefix transformer
(according to it's pytorch_model.bin.index.json file).
I think the simplest solution would be to modify the prefix in main.cpp
file to transformer
.